Is the function nnet adapted for neural network with a text? - prediction

I am trying to make a neural network model to predict sentiment about a text.
I have a data frame BDD with 3 columns:
1) there are adjectives. The name of the column is "ADJ"
2) there are nouns. The name of the column is "NOUN"
3) It's 1 if adj+noun is positive, -1 if negative and 0 if neutral. The name of the column is "V3"
BDD is divided with BDD_train and BDD_test. I succeeded to make my model, but the problem is when I make prediction.
This is the result of the model with BDD_train:
# weights: 476
initial value 133.618585
iter 10 value 18.334323
iter 20 value 15.922404
iter 30 value 15.823113
iter 40 value 15.816471
iter 50 value 15.811491
iter 60 value 15.811371
iter 60 value 15.811371
iter 60 value 15.811371
final value 15.811371
converged
nn_model <- nnet(V3~.,data = BDD_train,size = 5,decay = 1,linout = TRUE, maxit = 500)
pred2 <- predict(nn_model,newdata = BDD_test)
My error message is :
"Error in model.frame.default(Terms, newdata, na.action = na.omit, xlev = object$xlevels) :
le facteur ADJ a des nouveaux niveaux accommodative, Annual, Apparent, aware, definitive, federal, Good, greater, healthy, Monthly, more"

Related

ORTOOLS - CPSAT - Objective to minimize a value by intervals

I my model in ORTools CPSAT, I am computing a variable called salary_var (among others). I need to minimize an objective. Let’s call it « taxes ».
to compute the taxes, the formula is not linear but organised this way:
if salary_var below 10084, taxes corresponds to 0%
between 10085 and 25710, taxes corresponds to 11%
between 25711 and 73516, taxes corresponds to 30%
and 41% for above
For example, if salary_var is 30000 then, taxes are:
(25710-10085) * 0.11 + (30000-25711) * 0.3 = 1718 + 1286 = 3005
My question: how can I efficiently code my « taxes » objective?
Thanks for your help
Seb
This task looks rather strange, there is not much context and some parts of the task might touch some not-so-nice areas of finite-domain based solvers (large domains or scaling / divisions during solving).
Therefore: consider this as an idea / template!
Code
from ortools.sat.python import cp_model
# Data
INPUT = 30000
INPUT_UB = 1000000
TAX_A = 11
TAX_B = 30
TAX_C = 41
# Helpers
# new variable which is constrained to be equal to: given input-var MINUS constant
# can get negative / wrap-around
def aux_var_offset(model, var, offset):
aux_var = model.NewIntVar(-INPUT_UB, INPUT_UB, "")
model.Add(aux_var == var - offset)
return aux_var
# new variable which is equal to the given input-var IFF >= 0; else 0
def aux_var_nonnegative(model, var):
aux_var = model.NewIntVar(0, INPUT_UB, "")
model.AddMaxEquality(aux_var, [var, model.NewConstant(0)])
return aux_var
# Model
model = cp_model.CpModel()
# vars
salary_var = model.NewIntVar(0, INPUT_UB, "salary")
tax_component_a = model.NewIntVar(0, INPUT_UB, "tax_11")
tax_component_b = model.NewIntVar(0, INPUT_UB, "tax_30")
tax_component_c = model.NewIntVar(0, INPUT_UB, "tax_41")
# constraints
model.AddMinEquality(tax_component_a, [
aux_var_nonnegative(model, aux_var_offset(model, salary_var, 10085)),
model.NewConstant(25710 - 10085)])
model.AddMinEquality(tax_component_b, [
aux_var_nonnegative(model, aux_var_offset(model, salary_var, 25711)),
model.NewConstant(73516 - 25711)])
model.Add(tax_component_c == aux_var_nonnegative(model,
aux_var_offset(model, salary_var, 73516)))
tax_full_scaled = tax_component_a * TAX_A + tax_component_b * TAX_B + tax_component_c * TAX_C
# Demo
model.Add(salary_var == INPUT)
solver = cp_model.CpSolver()
status = solver.Solve(model)
print(list(map(lambda x: solver.Value(x), [tax_component_a, tax_component_b, tax_component_c, tax_full_scaled])))
Output
[15625, 4289, 0, 300545]
Remarks
As implemented:
uses scaled solving
produces scaled solution (300545)
no fiddling with non-integral / ratio / rounding stuff BUT large domains
Alternative:
Maybe something around AddDivisionEquality
Edit in regards to Laurents comments
In some scenarios, solving the scaled problem but being able to reason about the real unscaled values easier might make sense.
If i interpret the comment correctly, the following would be a demo (which i was not aware of and it's cool!):
Updated Demo Code (partial)
# Demo -> Attempt of demonstrating the objective-scaling suggestion
model.Add(salary_var >= 30000)
model.Add(salary_var <= 40000)
model.Minimize(salary_var)
model.Proto().objective.scaling_factor = 0.001 # DEFINE INVERSE SCALING
solver = cp_model.CpSolver()
solver.parameters.log_search_progress = True # SCALED BACK OBJECTIVE PROGRESS
status = solver.Solve(model)
print(list(map(lambda x: solver.Value(x), [tax_component_a, tax_component_b, tax_component_c, tax_full_scaled])))
print(solver.ObjectiveValue()) # SCALED BACK OBJECTIVE
Output (excerpt)
...
...
#1 0.00s best:30 next:[30,29.999] fixed_bools:0/1
#Done 0.00s
CpSolverResponse summary:
status: OPTIMAL
objective: 30
best_bound: 30
booleans: 1
conflicts: 0
branches: 1
propagations: 0
integer_propagations: 2
restarts: 1
lp_iterations: 0
walltime: 0.0039022
usertime: 0.0039023
deterministic_time: 8e-08
primal_integral: 1.91832e-07
[15625, 4289, 0, 300545]
30.0

SSP Algorithm minimal subset of length k

Suppose S is a set with t elements modulo n. There are indeed, 2^t subsets of any length. Illustrate a PARI/GP program which finds the smallest subset U (in terms of length) of distinct elements such that the sum of all elements in U is 0 modulo n. It is easy to write a program which searches via brute force, but brute force is infeasible as t and n get larger, so would appreciate help writing a program which doesn't use brute force to solve this instance of the subset sum problem.
Dynamic Approach:
def isSubsetSum(st, n, sm) :
# The value of subset[i][j] will be
# true if there is a subset of
# set[0..j-1] with sum equal to i
subset=[[True] * (sm+1)] * (n+1)
# If sum is 0, then answer is true
for i in range(0, n+1) :
subset[i][0] = True
# If sum is not 0 and set is empty,
# then answer is false
for i in range(1, sm + 1) :
subset[0][i] = False
# Fill the subset table in botton
# up manner
for i in range(1, n+1) :
for j in range(1, sm+1) :
if(j < st[i-1]) :
subset[i][j] = subset[i-1][j]
if (j >= st[i-1]) :
subset[i][j] = subset[i-1][j] or subset[i - 1][j-st[i-1]]
"""uncomment this code to print table
for i in range(0,n+1) :
for j in range(0,sm+1) :
print(subset[i][j],end="")
print(" ")"""
return subset[n][sm];
I got this code from here I don't know weather it seems to work.
function getSummingItems(a,t){
return a.reduce((h,n) => Object.keys(h)
.reduceRight((m,k) => +k+n <= t ? (m[+k+n] = m[+k+n] ? m[+k+n].concat(m[k].map(sa => sa.concat(n)))
: m[k].map(sa => sa.concat(n)),m)
: m, h), {0:[[]]})[t];
}
var arr = Array(20).fill().map((_,i) => i+1), // [1,2,..,20]
tgt = 42,
res = [];
console.time("test");
res = getSummingItems(arr,tgt);
console.timeEnd("test");
console.log("found",res.length,"subsequences summing to",tgt);
console.log(JSON.stringify(res));

How can I sum up functions that are made of elements of the imported dataset?

See the code and error. I have already tried Do, For,...and it is not working.
CODE + Error from Mathematica:
Import of survival probabilities _{k}p_x and _{k}p_y (calculated in excel)
px = Import["C:\Users\Eva\Desktop\kpx.xlsx"];
px = Flatten[Take[px, All], 1];
NOTE: The probability _{k}p_x can be found on the position px[[k+2, x -16]
i = 0.04;
v = 1/(1 + i);
JointLifeIndep[x_, y_, n_] = Sum[v^k*px[[k + 2, x - 16]]*py[[k + 2, y - 16]], {k , 0, n - 1}]
Part::pkspec1: The expression 2+k cannot be used as a part specification.
Part::pkspec1: The expression 2+k cannot be used as a part specification.
Part::pkspec1: The expression 2+k cannot be used as a part specification.
General::stop: Further output of Part::pkspec1 will be suppressed during this calculation.
Part of dataset (left corner of the dataset):
k\x 18 19 20
0 1 1 1
1 0.999478086278185 0.999363078716059 0.99927911905056
2 0.998841497412202 0.998642656911039 0.99858030519133
3 0.998121451605207 0.99794428814123 0.99788275311401
4 0.997423447323642 0.997247180349674 0.997174407432264
5 0.996726703362208 0.996539285828369 0.996437857252448
6 0.996019178300768 0.995803204773039 0.99563600297737
7 0.995283481416241 0.995001861216016 0.994823584922968
8 0.994482556091416 0.994189960607964 0.99405569519175
9 0.993671079225432 0.99342255996206 0.993339856748282
10 0.992904079096455 0.992707177451333 0.992611817294026
11 0.992189069953677 0.9919796017009 0.991832027835091
Without having the exact same data files to work with it is often easy for each of us to make mistakes that the other cannot reproduce or understand.
From your snapshot of your data set I used Export in Mathematica to try to reproduce your .xlsx file. Then I tried the following
px = Import["kpx.xlsx"];
px = Flatten[Take[px, All], 1];
py = px; (* fake some py data *)
i = 0.04;
v = 1/(1 + i);
JointLifeIndep[x_, y_, n_] := Sum[v^k*px[[k+2,x-16]]*py[[k+2,y-16]], {k,0,n-1}];
JointLifeIndep[17, 17, 12]
and it displays 362.402
Notice I used := instead of = in my definition of JointLifeIndep. := and = do different things in Mathematica. = will immediately evaluate the right hand side of that definition. This is possibly the reason that you are getting the error that you do.
You should also be careful with your subscript values and make sure that every subscript is between 1 and the number of rows (or columns) in your matrix.
So see if you can try this example with an Excel sheet containing only the snapshot of data that you showed and see if you get the same result that I do.
Hopefully that will be enough for you to make progress.

Clarification between Epoch and iteration

This answer points to the difference between an Epoch and an iteration while training a neural network. However, when I look at the source code for the solver API in the Stanford CS231n course (and I'm assuming this is the case for most libraries out there as well), during each iteration, batch_size number of examples are randomly selected with replacement. Thus, there is no guarantee that all examples would been seen during each epoch is there?
Does an epoch then mean that all examples would be seen in expectation? Or am I understanding this wrong?
Relevant Source Code:
def _step(self):
"""
Make a single gradient update. This is called by train() and should not
be called manually.
"""
# Make a minibatch of training data
num_train = self.X_train.shape[0]
batch_mask = np.random.choice(num_train, self.batch_size)
X_batch = self.X_train[batch_mask]
y_batch = self.y_train[batch_mask]
# Compute loss and gradient
loss, grads = self.model.loss(X_batch, y_batch)
self.loss_history.append(loss)
# Perform a parameter update
for p, w in self.model.params.iteritems():
dw = grads[p]
config = self.optim_configs[p]
next_w, next_config = self.update_rule(w, dw, config)
self.model.params[p] = next_w
self.optim_configs[p] = next_config
def train(self):
"""
Run optimization to train the model.
"""
num_train = self.X_train.shape[0]
iterations_per_epoch = max(num_train / self.batch_size, 1)
num_iterations = self.num_epochs * iterations_per_epoch
for t in xrange(num_iterations):
self._step()
# Maybe print training loss
if self.verbose and t % self.print_every == 0:
print '(Iteration %d / %d) loss: %f' % (
t + 1, num_iterations, self.loss_history[-1])
# At the end of every epoch, increment the epoch counter and decay the
# learning rate.
epoch_end = (t + 1) % iterations_per_epoch == 0
if epoch_end:
self.epoch += 1
for k in self.optim_configs:
self.optim_configs[k]['learning_rate'] *= self.lr_decay
# Check train and val accuracy on the first iteration, the last
# iteration, and at the end of each epoch.
first_it = (t == 0)
last_it = (t == num_iterations + 1)
if first_it or last_it or epoch_end:
train_acc = self.check_accuracy(self.X_train, self.y_train,
num_samples=1000)
val_acc = self.check_accuracy(self.X_val, self.y_val)
self.train_acc_history.append(train_acc)
self.val_acc_history.append(val_acc)
if self.verbose:
print '(Epoch %d / %d) train acc: %f; val_acc: %f' % (
self.epoch, self.num_epochs, train_acc, val_acc)
# Keep track of the best model
if val_acc > self.best_val_acc:
self.best_val_acc = val_acc
self.best_params = {}
for k, v in self.model.params.iteritems():
self.best_params[k] = v.copy()
# At the end of training swap the best params into the model
self.model.params = self.best_params
Thanks.
I believe, as you say, that in the Stanford course they are effectively using "epoch" with the less strict meaning of "expected number of times each example is seen during training". However, in my experience, most implementations consider an epoch as running through every example in the training set once, and I'd say they only chose the sampling with replacement for simplicity. If you have a good amount of data, chances are that you will not see a difference, but still, it is more correct to sample without replacement until there are no more examples.
You can check, for example, how Keras does the training in its source code; it's a bit complicated, but the important point is that make_batches is called to split the (possibly shuffled) examples into batches, which matches your initial idea of "epoch".

PySpark : how to split data without randomnize

there are function that can randomize spilt data
trainingRDD, validationRDD, testRDD = RDD.randomSplit([6, 2, 2], seed=0L)
I'm curious if there a way that we generate data the same partition ( train 60 / valid 20 / test 20 ) but without randommize ( let's just say use the current data to split first 60 = train, next 20 =valid and last 20 are for test data)
is there a possible way to split data similar way to split but not randomize?
The basic issue here is that unless you have an index column in your data, there is no concept of "first rows" and "next rows" in your RDD, it's just an unordered set. If you have an integer index column you could do something like this:
train = RDD.filter(lambda r: r['index'] % 5 <= 3)
validation = RDD.filter(lambda r: r['index'] % 5 == 4)
test = RDD.filter(lambda r: r['index'] % 5 == 5)