implications of using callr_function = NULL in targets package - targets-r-package

I was wondering what happens when callr_function = NULL?
Is it just issues with things maybe being in the environment/side effects?
Mainly wondering because I was passing quite large spatio-temporal arrays (0.5 to 5 gigs) and callr serialization via saveRDS is quite slow.
The two things I was thinking about was forking callr and dropping in a different save function or just using callr_function = NULL.

Ordinarily, targets runs the pipeline in a fresh new reproducible external R session. callr_function = NULL just says to run the pipeline in the current R session. I only recommend this for debugging because in serious use cases you could accidentally invalidate some targets based on changed data in your global environment. callr_function = NULL will probably not help solve issues with large memory. For that, I recommend selecting a more efficient storage format for your data, e.g. tar_target(..., format = "feather"). You could also try tar_option_set(memory = "transient", garbage_collection = TRUE) for better memory efficiency.

Related

Incremental SAT Solving: save solving instance - change model between runs

From my understanding incremental SAT solving helps to evaluate different models who are quite close to each other.
I want to use this to evaluate a model and if I change it later reevaluate it again using the previous solution for a faster result. However after looking into various SAT Solvers (Sat4J, Minisat, mathsat5) it seems like they are only able to solve incrementally when all models are presented within one run.
I'm quite new to SAT solving so I might be overlooking something. Is there no way to save a solving instance for later use? Is all learning lost on closing the instance?
In incremental mode, you can feed the solver with new constraints.
Depending on the settings, the solver may or may not forget previous learned clauses and heuristics.
To fully take advantage of the incremental mode and discard previously entered constraints in the system, you need to use "assumptions", i.e. specific variables which will activate or disable the constraints in the solver.
See e.g. this discussion in minisat newsgroup: https://groups.google.com/forum/#!topic/minisat/ffXxBpqKh90
SAT4J provides a mechanism, which allows you feed the solver and then remove parts of the clauses and add new ones clauses for the next check for satisfiability. Clauses to be removed need to be added to a ConstGroup. Unfortunately, it is slighly more complicated, as unit clauses need special handling. It works roughly like this:
solver = initialize it with clauses which are not to be removed
boolean satisfiable;
ConstrGroup group = new ConstrGroup();
IVecInt unit = new VecInt();
try {
for (all clauses to be added and removed) {
if (unit clause) {
unit.push(variable from clause);
} else {
group.add(solver.addClause(clause));
}
satisfiable = solver.isSatisfiable(unit);
} catch (ContradictionException e) {
satisfiable = false;
} finally {
group.removeFrom(solver);
}
Unfortunately, the removal of clauses is implemented in a rather naive way and requires quadratic effort in the number of clauses to be removed.
While this solution works in FeatureIDE (see isSatisfiable(Node node) in https://github.com/FeatureIDE/FeatureIDE/blob/develop/plugins/de.ovgu.featureide.fm.core/src/org/prop4j/SatSolver.java), it is likely that there are way more performant solutions out there.
The other solution with assumptions does not work in our case, as we have millions of queries to a single SAT solver instance with up-to 20,000 variables. Assumptions would increate the number of variables from 20 thousand to a million, which is unlikely to help.

How to have multiple instances of MATLAB save the same file simultaneously

I am currently writing code to run a series of time-consuming experiments using nodes on a Unix cluster. Each of these experiments takes over 3 days runs on a a 12-core machine. When each experiment is done, I am hoping to have it save some data to a common file.
I have a slight issue in that I submit all of my experiments to the cluster at the same time and so they are likely to be saving to the same file at the same time as well.
I am wondering what will happen when multiple instances of MATLAB try to save the same file at the same time (error/crash/nothing). Whatever the outcome, could I work around it using a try/catch loop as follows:
n_tries = 0;
while n_tries < 10
try
save('common_file',data)
n_tries = 10;
catch
wait_time = 60 * rand;
pause(wait_time);
n_tries = n_tries+1;
end
end
end
Don't.
All Matlab functions are explicitly not safe to use in a multi-threading/processing environment.
If you write to one mat-file simultaneously from multiple matlab sessions, chances are good that either several variables are missing (because e.g. 2 matlab append to the same state of the file) or the whole file gets corrupted.
Save individual files and merge them in a post-processing step.
For such long simulation runs, don't aggregate your data automatically unless you have a reliable framework. There are several reasons:
Out of Memory exceptions or similar while writing can destroy all previous results, this is likely to happen while writing large amounts of data.
Coding errors can destroy previous results. Your code will overwrite at least the most recent added data in case of a collision.
Undetected errors in mex functions, which by randomly hit the matlab address space instead of casing a segmentation fault, can cause Matlab to write crap to your Matfile and destroy previous results.
Use some unique pattern, e.g. pc-name + current date/time
You would be best served by having a single recorder task that does the file output and queue the save information to that task.
Don't forget that the output "file" that you supply to the matlab only has to be file like - i.e. support the necessary methods.

Google Spreadsheet turn off autosave via script?

I'm fairly new to using Google Docs, but I have come to really appreciate it. The scripting is pretty easy to accomplish simple tasks, but I have come to realize a potential speed issue that is a little frustrating.
I've got a sheet that I use for my business to calculate the cost of certain materials on a jobsite. It works great, but was a little tedious to clear between jobs so I wrote a simple script to clear the ranges (defined by me and referenced by name) that I needed emptied.
Once again, worked great. The only problem with it is that clearing a few ranges (seven) ends up taking about ten full seconds. I -believe- that this is because the spreadsheet is being saved after each range is cleared, which becomes time intensive.
What I'd like to do is test this theory by disabling autosave in the script, and then re enabling it after the ranges have been cleared. I don't know if this is even possible because I haven't seen a function in the API to do it, but if it is I'd love to know about it.
Edit: this is the function I'm using as it stands. I've tried rewriting it a couple of times to be more concise and less API call intensive, but so far I haven't had any luck in reducing the time it takes to process the calls.
function clearSheet() {
var ss = SpreadsheetApp.getActiveSpreadsheet();
var sheet = ss.getActiveSheet();
sheet.getRange("client").clear();
sheet.getRange("lm_group_1").clear({contentsOnly:true});
sheet.getRange("lm_group_2").clear({contentsOnly:true});
sheet.getRange("dr_group_1").clear({contentsOnly:true});
sheet.getRange("dr_group_2").clear({contentsOnly:true});
sheet.getRange("fr_group_1").clear({contentsOnly:true});
sheet.getRange("fr_group_2").clear({contentsOnly:true});
sheet.getRange("gr_group_1").clear({contentsOnly:true});
sheet.getRange("client_name").activate();
}
That is not possible, and will probably never be. It's not "the nature" for Google Docs.
But depending on how you wrote your script, it's probable that all changes are already being wrote at once, in the end. There's some API calls that may be forcing a flush of your writings to the spreadsheet (like trying to read after you wrote something), but we'd need to see your code to check that.
Anyway, you can always check the spreadsheet revision history to verify if it's being done at once or in multiple steps.
About the performance, Apps Scripts have a natural delay that is unavoidable, but it's not 10s, so there's probably room to improve on your script, using fewer API calls and preferring batch calls like setValues over setValue and so on. But then again, we'd have to see your code to assert that and give more helpful tips.

mvc-mini-profiler slows down Entity Framework

I've set up mvc-mini-profiler against my Entity Framework-powered MVC 3 site. Everything is duly configured; Starting profiling in Application_Start, ending it in Application_End and so on. The profiling part works just fine.
However, when I try to swap my data model object generation to providing profilable versions, performance slows to a grind. Not every SQL query, but some queries take about 5x the entire page load. (The very first page load after firing up IIS Express takes a bit longer, but this is sustained.)
Negligible time (~2ms tops) is spent querying, executing and "data reading" the SQL, while this line:
var person = dataContext.People.FirstOrDefault(p => p.PersonID == id);
...when wrapped in using(profiler.Step()) is recorded as taking 300-400 ms. I profiled with dotTrace, which confirmed that the time is actually spent in EF as usual (the profilable components do make very brief appearances), only it is taking much longer.
This leads me to believe that the connection or some of its constituent parts are missing sufficient data, making EF perform far worse.
This is what I'm using to make the context object (my edmx model's class is called DataContext):
var conn = ProfiledDbConnection.Get(
/* returns an SqlConnection */CreateConnection());
return CreateObjectContext<DataContext>(conn);
I originally used the mvc-mini-profiler provided ObjectContextUtils.CreateObjectContext method. I dove into it and noticed that it set a wildcard metadata workspace path string. Since I have the database layer isolated to one project and several MVC sites as other projects using the code, those paths have changed and I'd rather be more specific. Also, I thought this was the cause of the performance issue. I duplicated the CreateObjectContext functionality into my own project to provide this, as such:
public static T CreateObjectContext<T>(DbConnection connection) where T : System.Data.Objects.ObjectContext {
var workspace = new System.Data.Metadata.Edm.MetadataWorkspace(
GetMetadataPathsString().Split('|'),
// ^-- returns
// "res://*/Redacted.csdl|res://*/Redacted.ssdl|res://*/Redacted.msl"
new Assembly[] { typeof(T).Assembly });
// The remainder of the method is copied straight from the original,
// and I carried over a duplicate CtorCache too to make this work.
var factory = DbProviderServices.GetProviderFactory(connection);
var itemCollection = workspace.GetItemCollection(System.Data.Metadata.Edm.DataSpace.SSpace);
itemCollection.GetType().GetField("_providerFactory", // <==== big fat ugly hack
BindingFlags.NonPublic | BindingFlags.Instance).SetValue(itemCollection, factory);
var ec = new System.Data.EntityClient.EntityConnection(workspace, connection);
return CtorCache<T, System.Data.EntityClient.EntityConnection>.Ctor(ec);
}
...but it doesn't seem to make much of a difference. The problem still exists whether I use the above hacked version that's more specific with metadata workspace paths or the mvc-mini-profiler provided version. I just thought I'd mention that I've tried this too.
Having exhausted all this, I'm at my wits' end. Once again: when I just provide my data context as usual, no performance is lost. When I provide a "profilable" data context, performance plummets for certain queries (I don't know what influences this either). What could mvc-mini-profiler do that's wrong? Am I still feeding it the wrong data?
I think this is the same problem as this person ran into.
I just resolved this issue today.
see: http://code.google.com/p/mvc-mini-profiler/issues/detail?id=43
It happened cause some of our fancy hacks were not cached well enough. In particular:
var workspace = new System.Data.Metadata.Edm.MetadataWorkspace(
new string[] { "res://*/" },
new Assembly[] { typeof(T).Assembly });
Is a very expensive call, so we need to cache the workspace.
Profiling, by definition, will effect performance of the application being profiled. The profiler needs to insert it's own method calls throughout the application, intercept low level system calls, and record all that data someplace (meaning writes to disk). All of those tasks take up precious CPU cycles, memory, and disk access.

hosting simple python scripts in a container to handle concurrency, configuration, caching, etc

My first real-world Python project is to write a simple framework (or re-use/adapt an existing one) which can wrap small python scripts (which are used to gather custom data for a monitoring tool) with a "container" to handle boilerplate tasks like:
fetching a script's configuration from a file (and keeping that info up to date if the file changes and handle decryption of sensitive config data)
running multiple instances of the same script in different threads instead of spinning up a new process for each one
expose an API for caching expensive data and storing persistent state from one script invocation to the next
Today, script authors must handle the issues above, which usually means that most script authors don't handle them correctly, causing bugs and performance problems. In addition to avoiding bugs, we want a solution which lowers the bar to create and maintain scripts, especially given that many script authors may not be trained programmers.
Below are examples of the API I've been thinking of, and which I'm looking to get your feedback about.
A scripter would need to build a single method which takes (as input) the configuration that the script needs to do its job, and either returns a python object or calls a method to stream back data in chunks. Optionally, a scripter could supply methods to handle startup and/or shutdown tasks.
HTTP-fetching script example (in pseudocode, omitting the actual data-fetching details to focus on the container's API):
def run (config, context, cache) :
results = http_library_call (config.url, config.http_method, config.username, config.password, ...)
return { html : results.html, status_code : results.status, headers : results.response_headers }
def init(config, context, cache) :
config.max_threads = 20 # up to 20 URLs at one time (per process)
config.max_processes = 3 # launch up to 3 concurrent processes
config.keepalive = 1200 # keep process alive for 10 mins without another call
config.process_recycle.requests = 1000 # restart the process every 1000 requests (to avoid leaks)
config.kill_timeout = 600 # kill the process if any call lasts longer than 10 minutes
Database-data fetching script example might look like this (in pseudocode):
def run (config, context, cache) :
expensive = context.cache["something_expensive"]
for record in db_library_call (expensive, context.checkpoint, config.connection_string) :
context.log (record, "logDate") # log all properties, optionally specify name of timestamp property
last_date = record["logDate"]
context.checkpoint = last_date # persistent checkpoint, used next time through
def init(config, context, cache) :
cache["something_expensive"] = get_expensive_thing()
def shutdown(config, context, cache) :
expensive = cache["something_expensive"]
expensive.release_me()
Is this API appropriately "pythonic", or are there things I should do to make this more natural to the Python scripter? (I'm more familiar with building C++/C#/Java APIs so I suspect I'm missing useful Python idioms.)
Specific questions:
is it natural to pass a "config" object into a method and ask the callee to set various configuration options? Or is there another preferred way to do this?
when a callee needs to stream data back to its caller, is a method like context.log() (see above) appropriate, or should I be using yield instead? (yeild seems natural, but I worry it'd be over the head of most scripters)
My approach requires scripts to define functions with predefined names (e.g. "run", "init", "shutdown"). Is this a good way to do it? If not, what other mechanism would be more natural?
I'm passing the same config, context, cache parameters into every method. Would it be better to use a single "context" parameter instead? Would it be better to use global variables instead?
Finally, are there existing libraries you'd recommend to make this kind of simple "script-running container" easier to write?
Have a look at SQL Alchemy for dealing with database stuff in python. Also to make script writing easier for dealing with concurrency look into Stackless Python.