assigning a destination frame to a prediction in h2o.ai - variable-assignment

I like h2o.ai for machine learning using R.
https://cran.r-project.org/web/packages/h2o/h2o.pdf
I like random forests, but I'm making a few thousand predictions in a loop.
It is spamming up my memory with things like this:
I can't afford to keep them all in memory. I'm making my very nice computer work very hard. That means it doesn't have the capacity to hold all the balls in the air at once.
If I could assign a destination frame name to the prediction then each new one would overwrite the old ones.
How do I assign a destination frame name when I am performing "h2o.predict" on an object?
Things that I have tried that did not work:
h2o.predict(object = rf.hex, newdata = test.hex, predictions_frame = "predict.hex")
h2o.predict(object = rf.hex, newdata = test.hex, destination_frame = "predict.hex")
h2o.predict(object = rf.hex, newdata = test.hex, model_id = "predict.hex")

There is no way that I am aware of.
But as an alternative, inside your loop, you could call h2o.rm() on the return value from h2o.predict(). It is worth calling h2o.gc() as well. Something like:
for(data in alldata){
# ... prepare newdata
p = h2o.predict(model, newdata)
# ... do something with p here
h2o.rm(p)
h2o.rm(newdata) # If also not needed any more
h2o.gc()
}
Aside: you said "I'm making a few thousand predictions in a loop". Assuming they were all against the same model, remember you can batch them up, and give all thousand predictions in a single newdata dataframe. One call to h2o.predict() with 1000 entries is much more efficient than making 1000 h2o.predict() calls, for one newdata entry at a time.

Related

CPU memory allocation failed in Dynet

I am not sure why I am running out of memory. Take the parser of Goldberg, all I do is to change this line:
scores, exprs = self.__evaluate(conll_sentence, True)
and add a for loop around it to repeat it K times:
for k in xrange(K):
scores, exprs = self.__evaluate(conll_sentence, True)
# do something
Then in the getExpr, i do the following:
samples_out = np.random.normal(0,0.001, (1, self.hidden_units))
samples_FOH = np.random.normal(0,0.001,(self.hidden_units, self.ldims * 2))
samples_FOM = np.random.normal(0,0.001,(self.hidden_units, self.ldims * 2))
samples_Bias = np.random.normal(0,0.001, (self.hidden_units))
XoutLayer = self.outLayer.expr()+inputTensor(samples_out)
XhidLayerFOH = self.hidLayerFOH.expr()+inputTensor(samples_FOH)
XhidLayerFOM = self.hidLayerFOM.expr()+inputTensor(samples_FOM)
XhidBias = self.hidBias.expr()+inputTensor(samples_Bias)
if sentence[i].headfov is None:
sentence[i].headfov = XhidLayerFOH * concatenate([sentence[i].lstms[0], sentence[i].lstms[1]])
if sentence[j].modfov is None:
sentence[j].modfov = XhidLayerFOM * concatenate([sentence[j].lstms[0], sentence[j].lstms[1]])
output = XoutLayer * self.activation(sentence[i].headfov + sentence[j].modfov + XhidBias)
return output
Essentially what is happening in the above block is to first generate normally distributed noise, then add it to the trained values. But it seems somewhere along the way, all the generated values stay in the memory and it just runs out of memory. Any one knows why?
Dynet expressions stay in memory until the next call to renew_cg().
So the fix would be to call it after each iteration of your loop, provided that you have retrieved all information you needed from the computation graph.
Side note: when you do a simple addition, such as:
XoutLayer = self.outLayer.expr()+inputTensor(samples_out)
no addition is actually performed. You just create a new expression and specify how to evaluate it from other expressions. The actual computation is performed when there is a call to .forward() (or .value(), etc) on XoutLayer or on an expression whose computation depends on XoutLayer.
So, dynet needs to allocate memory for all expressions in the current computation graph.

"Appending" to an ArraySlice?

Say ...
you have about 20 Thing
very often, you do a complex calculation running through a loop of say 1000 items. The end result is a varying number around 20 each time
you don't know how many there will be until you run through the whole loop
you then want to quickly (and of course elegantly!) access the result set in many places
for performance reasons you don't want to just make a new array each time. note that unfortunately there's a differing amount so you can't just reuse the same array trivially.
What about ...
var thingsBacking = [Thing](repeating: Thing(), count: 100) // hard limit!
var things: ArraySlice<Thing> = []
func fatCalculation() {
var pin: Int = 0
// happily, no need to clean-out thingsBacking
for c in .. some huge loop {
... only some of the items (roughly 20 say) become the result
x = .. one of the result items
thingsBacking[pin] = Thing(... x, y, z )
pin += 1
}
// and then, magic of slices ...
things = thingsBacking[0..<pin]
(Then, you can do this anywhere... for t in things { .. } )
What I am wondering, is there a way you can call to an ArraySlice<Thing> to do that in one step - to "append to" an ArraySlice and avoid having to bother setting the length at the end?
So, something like this ..
things = ... set it to zero length
things.quasiAppend(x)
things.quasiAppend(x2)
things.quasiAppend(x3)
With no further effort, things now has a length of three and indeed the three items are already in the backing array.
I'm particularly interested in performance here (unusually!)
Another approach,
var thingsBacking = [Thing?](repeating: Thing(), count: 100) // hard limit!
and just set the first one after your data to nil as an end-marker. Again, you don't have to waste time zeroing. But the end marker is a nuisance.
Is there a more better way to solve this particular type of array-performance problem?
Based on MartinR's comments, it would seem that for the problem
the data points are incoming and
you don't know how many there will be until the last one (always less than a limit) and
you're having to redo the whole thing at high Hz
It would seem to be best to just:
(1) set up the array
var ra = [Thing](repeating: Thing(), count: 100) // hard limit!
(2) at the start of each run,
.removeAll(keepingCapacity: true)
(3) just go ahead and .append each one.
(4) you don't have to especially mark the end or set a length once finished.
It seems it will indeed then use the same array backing. And it of course "increases the length" as it were each time you append - and you can iterate happily at any time.
Slices - get lost!

XMeans ELKI fails at every third input file

I'm trying to cluster image data (stored in 100 separate csv files) with ELKI's XMeans algorithm. It works well for the first two files, but then the algorithm hangs on forever while processing the third file. It looks like the problem occurs at every 3rd file or so, because when I start the loop, that goes over all files at the fourth file, it works for the fourth and the fifth file, but not for the sixth file. Same goes for the 9th and 11th file... but maybe that's coincidence.
My XMeans call looks like this:
DatabaseConnection dbc = new ArrayAdapterDatabaseConnection(data);
Database db = new StaticArrayDatabase(dbc, null);
db.initialize();
Relation<NumberVector> rel = db.getRelation(TypeUtil.NUMBER_VECTOR_FIELD);
DBIDRange ids = (DBIDRange) rel.getDBIDs();
SquaredEuclideanDistanceFunction dist = SquaredEuclideanDistanceFunction.STATIC;
RandomlyGeneratedInitialMeans init = new RandomlyGeneratedInitialMeans(RandomFactory.DEFAULT);
KMeansInitialization initializer = new FirstKInitialMeans();
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data);
KMeansQualityMeasure informationCriterion = new WithinClusterMeanDistanceQualityMeasure();
RandomFactory random = new RandomFactory(123);
KMeans<NumberVector, KMeansModel> innerKMeans = new KMeansHamerly<>(dist, 50, 1, init, true);
XMeans<NumberVector, KMeansModel> xm = new XMeans<>(dist, 5, 50, 1, innerKMeans, initializer, splitInitializer, informationCriterion, random);
Clustering<KMeansModel> c = xm.run(db, rel);
I'm not too sure about these four lines, so maybe that's why it works for some files and for others it doesn't:
KMeansInitialization initializer = new FirstKInitialMeans();
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans(data);
KMeansQualityMeasure informationCriterion = new WithinClusterMeanDistanceQualityMeasure();
RandomFactory random = new RandomFactory(123);
data is just a double[][] which contains the data from the input files.
Any help would be very appreciated!
Please, use the Parameterization API to configure X-means.
Because of the nested k-means, it is very easy to configure things badly.
The initializer of the inner k-means class must be set to this:
PredefinedInitialMeans splitInitializer = new PredefinedInitialMeans((double[][]) null);
KMeans<NumberVector, KMeansModel> innerKMeans = new KMeansHamerly<>(dist, 50, 1, splitInitializer, true);
because otherwise X-means currently cannot control the initialization of the inner algorithm. I will remove this parameter, and have XMeans set the initializer of the inner algorithm.
Without a stack trace (as mentioned by #Anony-Mousse) it is hard to say what is happening. My best guess is that this meta-algorithm (an algorithm that runs another algorithm!) is not correctly configured and maybe chooses bad initialial values?

Is there a way to change directory in Modelica/Dymola automatically?

I have the following problem:
I have over 20 different models which I want to simulate one after another but I want to change the simulation directory each time.
Right now I'm manually changing directory after each simulation (from ./ModelOne to ./ModelTwo) and I'd like to know if there's a way to change it automatically when I initialize or translate the new model.
Regards
Nev
the best way is to write a script I think:
pathOfSave = {"E:\\work\\modelica\\SimulationResult\\Model1\\","E:\\work\\modelica\\SimulationResult\\Model2\\"};
nbSim = 2;
pathOfMod = { "MyModel.",
"MyModel.};
modelsToSimulate = { ""Model1" ,
"Model2"};
//If equdistant=true: ensure that the same number of data points is written in all result files
//store variables at events is disabled.
experimentSetupOutput(equdistant=false, events=false);
//Keep in the plot memory the last nbSim results
experimentSetupOutput(equdistant=false, events=false);
for i in 1:nbSim loop
//delete the result file if it already exists
Modelica.Utilities.Files.removeFile(pathOfSave + modelsToSimulate[i]);
//translate models
translateModel(pathOfMod[i]+modelsToSimulate[i]);
// simulate
simulateModel(
pathOfMod[i]+modelsToSimulate[i],
method="dassl",
stopTime=186350,
numberOfIntervals=nbOfPoi,
resultFile=pathOfSave + modelsToSimulate[i]);
end for;
You can also put the command cd("mynewpath") in the initial algorithm section, if you want it tobe attached to the model.
model example
Real variable;
protected
parameter String currDir = Modelica.Utilities.System.getWorkDirectory();
initial algorithm
cd("C:\\Users\\xxx\\Documents\\Dymola\\MyModelFolder");
equation
variable = time;
when terminal() then
cd(currDir);
end when;
end example;
In any case you can find all commands of dymola in the manual one under the section "builtin commands".
I hope this helps,
Marco

loop through structures and finding the correlation

The following example resembles a similar problem that I'm dealing with, although the code below is merely an example, it is structured in the same format as my actual data set.
clear all
England = struct('AirT',rand(320,1),'SolRad',rand(320,1),'Rain',rand(320,1));
Wales = struct('AirT',rand(320,1),'SolRad',rand(320,1),'Rain',rand(320,1));
Ireland = struct('AirT',rand(320,1),'SolRad',rand(320,1),'Rain',rand(320,1));
Scotland = struct('AirT',rand(320,1),'SolRad',rand(320,1),'Rain',rand(320,1));
Location = struct('England',England,'Wales', Wales, 'Ireland',Ireland,'Scotland',Scotland);
FieldName={'England','Wales','Scotland','Ireland'};
Data = {England.AirT,Wales.AirT,Scotland.AirT,Ireland.AirT};
Data = [FieldName;Data];
Data = struct(Data{:});
Data = cell2mat(struct2cell(Data)');
[R,P] = corrcoef(Data,'rows','pairwise');
R_Value= [FieldName(nchoosek(1:size(R,1),2)) num2cell(nonzeros(tril(R,-1)))];
So, this script would show the correlation between pairs of Air Temperature of 4 locations. I'm looking for a way of also looking at the correlation between 'SolRad' and 'Rain' between the locations (same process as for AirT) or any variables denoted in the structure. I could do this by replacing the inputs into 'Data' but this seems rather long winded especially when involving many different variables. Any ideas on how to do this? I've tried using a loop but it seems harder than I though to try and get the data into the same format as the example.
Let's see if this helps, or is what you are thinking:
clear all
England = struct('AirT',rand(320,1),'SolRad',rand(320,1),'Rain',rand(320,1));
Wales = struct('AirT',rand(320,1),'SolRad',rand(320,1),'Rain',rand(320,1));
Ireland = struct('AirT',rand(320,1),'SolRad',rand(320,1),'Rain',rand(320,1));
Scotland = struct('AirT',rand(320,1),'SolRad',rand(320,1),'Rain',rand(320,1));
Location = struct('England',England,'Wales', Wales, 'Ireland',Ireland,'Scotland',Scotland);
% get all the location fields
FieldName = transpose(fieldnames(Location));
% get the variables recorded at the first location
CorrData = fieldnames(Location.(FieldName{1}));
% get variables which were stored at all locations(just to be safe,
% we know that they are all the same)
for ii=2:length(FieldName)
CorrData = intersect(CorrData,fieldnames(Location.(FieldName{ii})));
end
% process each variable that was recorded
for ii=1:length(CorrData)
Data = cell(1,length(FieldName));
% get the variable data from each location and store in Data
for jj=1:length(FieldName)
Data{jj} = Location.(FieldName{jj}).(CorrData{ii});
end
% process the data
Data = [FieldName;Data];
Data = struct(Data{:});
Data = cell2mat(struct2cell(Data)');
[R,P] = corrcoef(Data,'rows','pairwise');
R_Value= [FieldName(nchoosek(1:size(R,1),2)) num2cell(nonzeros(tril(R,-1)))];
% display the data, sounds good right?
fprintf(1,'Correlation for %s\n',CorrData{ii});
for jj=1:size(R_Value,1)
fprintf(1,'%s\t%s\t%f\n',R_Value{jj,1},R_Value{jj,2},R_Value{jj,3});
end
end
Let me know if I misunderstood, or if this is more involved than what you were thinking. Thanks!
fieldnames(s) and dynamic field references are your friend.
What I would suggest is to make one structure in which 'name' is a field, and the other fields are whatever you'd like. Regardless of how you set up your structure s, you can use fn = fieldnames(s); to return a cell array of the fields. You can access the contents of your structure using these names by using parentheses around the variable containing the name.
fn = fieldnames(s);
for i=1:length(fn)
disp([fn{i} ':' s.(fn{i})]
end
Whatever you do with the values is up to you, of course!