hosting simple python scripts in a container to handle concurrency, configuration, caching, etc - plugins

My first real-world Python project is to write a simple framework (or re-use/adapt an existing one) which can wrap small python scripts (which are used to gather custom data for a monitoring tool) with a "container" to handle boilerplate tasks like:
fetching a script's configuration from a file (and keeping that info up to date if the file changes and handle decryption of sensitive config data)
running multiple instances of the same script in different threads instead of spinning up a new process for each one
expose an API for caching expensive data and storing persistent state from one script invocation to the next
Today, script authors must handle the issues above, which usually means that most script authors don't handle them correctly, causing bugs and performance problems. In addition to avoiding bugs, we want a solution which lowers the bar to create and maintain scripts, especially given that many script authors may not be trained programmers.
Below are examples of the API I've been thinking of, and which I'm looking to get your feedback about.
A scripter would need to build a single method which takes (as input) the configuration that the script needs to do its job, and either returns a python object or calls a method to stream back data in chunks. Optionally, a scripter could supply methods to handle startup and/or shutdown tasks.
HTTP-fetching script example (in pseudocode, omitting the actual data-fetching details to focus on the container's API):
def run (config, context, cache) :
results = http_library_call (config.url, config.http_method, config.username, config.password, ...)
return { html : results.html, status_code : results.status, headers : results.response_headers }
def init(config, context, cache) :
config.max_threads = 20 # up to 20 URLs at one time (per process)
config.max_processes = 3 # launch up to 3 concurrent processes
config.keepalive = 1200 # keep process alive for 10 mins without another call
config.process_recycle.requests = 1000 # restart the process every 1000 requests (to avoid leaks)
config.kill_timeout = 600 # kill the process if any call lasts longer than 10 minutes
Database-data fetching script example might look like this (in pseudocode):
def run (config, context, cache) :
expensive = context.cache["something_expensive"]
for record in db_library_call (expensive, context.checkpoint, config.connection_string) :
context.log (record, "logDate") # log all properties, optionally specify name of timestamp property
last_date = record["logDate"]
context.checkpoint = last_date # persistent checkpoint, used next time through
def init(config, context, cache) :
cache["something_expensive"] = get_expensive_thing()
def shutdown(config, context, cache) :
expensive = cache["something_expensive"]
expensive.release_me()
Is this API appropriately "pythonic", or are there things I should do to make this more natural to the Python scripter? (I'm more familiar with building C++/C#/Java APIs so I suspect I'm missing useful Python idioms.)
Specific questions:
is it natural to pass a "config" object into a method and ask the callee to set various configuration options? Or is there another preferred way to do this?
when a callee needs to stream data back to its caller, is a method like context.log() (see above) appropriate, or should I be using yield instead? (yeild seems natural, but I worry it'd be over the head of most scripters)
My approach requires scripts to define functions with predefined names (e.g. "run", "init", "shutdown"). Is this a good way to do it? If not, what other mechanism would be more natural?
I'm passing the same config, context, cache parameters into every method. Would it be better to use a single "context" parameter instead? Would it be better to use global variables instead?
Finally, are there existing libraries you'd recommend to make this kind of simple "script-running container" easier to write?

Have a look at SQL Alchemy for dealing with database stuff in python. Also to make script writing easier for dealing with concurrency look into Stackless Python.

Related

Is that able to have a transaction-wise global variable in PostgreSQL?

The situation is that: I have a function F1 which write into a buffer and the buffer will be write to external files when the function F1's fcinfo->flinfo->fn_mcxt is released; I also have a function F2 which depends on those external files, so when it runs, I want to make sure that all existing function F1's buffer (in this trasaction) have already write out to the external files. The two functions are independent, except when they are performed together.
As the result, I want that buffer to be a global variable in this transaction, so F2 can check it and decide if it is empty. If it is not empty, F2 can write it out manually.
Since PostgreSQL uses multiprocessing, one backend cannot see the global variables in another backend. You could write a _PG_init function that creates a shared memory segment for that purpose (see pg_stat_statements). That requires that your library is added to shared_preload_libraries.
A simpler alternative might be to use the LISTEN / NOTIFY facility of PostgreSQL to synchronize different backends.

Perl script running a periodic (main) task and providing a REST interface

I am working on a Perl script which does some periodic processing based on file-system contents.
The overall structure is like this:
# ... initialization...
while(1) {
# ... scan filesystem, perform actions depending on changes detected ...
sleep 5;
}
I would like to add the ability to input some data into this process by means of exposing an interface through HTTP. E.g. I would like to add an endpoint to skip the sleep, but also some means to input data that is processed in the next iteration. Additionally, I would like to be able to query some of the program's status through HTTP (i.e. a simple fork() to run the webserver-part in a separate process is insufficient?)
So far I have already used the Dancer2 framework once but it has a start; call that blocks and thus does not allow any other tasks (like my loop) to run. Additionally, I could of course move the code which is currently inside the loop to an endpoint exposed through Dancer2 but then I would need to call that periodically (though an external program?) which seems to be quite an obscure indirection compared to just having the webserver-part running in background.
Is it possible to unobtrusively (i.e. without blocking the program) add a REST-server capability to a Perl script? If yes: Which modules would be used for the purpose? If no: Should I really implement an external process to periodically invoke a certain endpoint or pursue a different solution altogether?
(I have tried to add a dancer2 tag, but could not do so due to insufficient reputation. Do not be mislead by this: I have so far only tried with Dancer2 not the Dancer (v.1))
You could try to launch your processing loop in a background thread, before you run start;.
See man perlthrtut
You probably want use threads::shared; to declare some variables shared between the REST part and the background thread. Or use dedicated queues/event mechanisms.

Non-RESTful backend with backbone.js

I'm evaluating backbone.js as a potential javascript library for use in an application which will have a few different backends: WebSocket, REST, and 3rd party library producing JSON. I've read some opinions that backbone.js works beautifully with RESTful backends so long as the api is 'by the book' and follows the appropriate http verbage. Can someone elaborate on what this means?
Also, how much trouble is it to get backbone.js to connect to WebSockets? Lastly, are there any issues with integrating a backbone.js model with a function which returns JSON - in other words does the data model always need to be served via REST?
Backbone's power is that it has an incredibly flexible and modular structure. It means that any part of Backbone you can use, extend, take out, or modify. This includes the AJAX functionality.
Backbone doesn't "care" where do you get the data for your collections or models. It will help you out by providing an out of the box RESTful "ajax" solution, but it won't be mad if you want to use something else!
This allows you to find (or write) any plugin you want to handle the server interaction. Just look on backplug.io, Google, and Github.
Specifically for Sockets there is backbone.iobind.
Can't find a plugin, no worries. I can tell you exactly how to write one (it's 100x easier than it sounds).
The first thing that you need to understand is that overwriting behavior is SUPER easy. There are 2 main ways:
Globally:
Backbone.Collection.prototype.sync = function() {
//screw you Backbone!!! You're completely useless I am doing my own thing
}
Per instance
var MySpecialCollection = Backbone.Collection.extend({
sync: function() {
//I like what you're doing with the ajax thing... Clever clever ;)
// But for a few collections I wanna do it my way. That cool?
});
And the only other thing you need to know is what happens when you call "fetch" on a collection. This is the "by the book"/"out of the box behavior" behavior:
collection#fetch is triggered by user (YOU). fetch will delegate the ACTUAL fetching (ajax, sockets, local storage, or even a function that instantly returns json) to some other function (collection#sync). Whatever function is in collection.sync has to has to take 3 arguments:
action: create (for creating), action: read (for fetching), delete (for deleting), or update (for updating) = CRUD.
context (this variable) - if you don't know what this does it, don't worry about it, not important for now
options - where da magic is. We only care about 1 option though
success: a callback that gets called when the data is "ready". THIS is the callback that collection#fetch is interested in because that's when it takes over and does it's thing. The only requirements is that sync passes it the following 1st argument
response: the actual data it got back
Now
has to return a success callback in it's options that gets executed when it's done getting the data. That function what it's responsible for is
Whenever collection#sync is done doing it's thing, collection#fetch takes back over (with that callback in passed in to success) and does the following nifty steps:
Calls set or reset (for these purposes they're roughly the same).
When set finishes, it triggers a sync event on the collection broadcasting to the world "yo I'm ready!!"
So what happens in set. Well bunch of stuff (deduping, parsing, sorting, parsing, removing, creating models, propagating changesand general maintenance). Don't worry about it. It works ;) What you need to worry about is how you can hook in to different parts of this process. The only two you should worry about (if your wraps data in weird ways) are
collection#parse for parsing a collection. Should accept raw JSON (or whatever format) that comes from the server/ajax/websocket/function/worker/whoknowwhat and turn it into an ARRAY of objects. Takes in for 1st argument resp (the JSON) and should spit out a mutated response for return. Easy peasy.
model#parse. Same as collection but it takes in the raw objects (i.e. imagine you iterate over the output of collection#parse) and splits out an "unwrapped" object.
Get off your computer and go to the beach because you finished your work in 1/100th the time you thought it would take.
That's all you need to know in order to implement whatever server system you want in place of the vanilla "ajax requests".

Is there a way to copy files in a non-blocking way in Scala?

I have checked java.nio.file.Files.copy but that blocks a thread until the copy is done. Are there any libraries that allow one to copy a file in a non-blocking way? I need to perform many of these operations simultaneously and cannot afford to have so many threads blocked.
While I could write something myself using non-blocking streams, I would rather use something tried and tested that would guarantee a correct copy every time (or detect if something went wrong).
Check this: Iterate over lines in a file in parallel (Scala)?
val chunkSize = 128 * 1024
val iterator = Source.fromFile(path).getLines.grouped(chunkSize)
iterator.foreach { lines =>
lines.par.foreach { line => process(line) }
}
Reading (copying) files by chunks in parallel. In this case "par" is used.
So it quite non-blocking in terms / scope of processors (cores).
But you may follow same idea of chunks, for example using Akka/Future/Promises to be even in wider scopes.
You may customize you chunk-size deepening on your performance characteristic, level of system load, etc..
One more link that explains possible way to do read / write data from (property) file in parallel using Akka Actors. This is not quite that you might be want, but it may give an idea.
Idea - you may build your own not-blocking way of reading / copying files.
--
And about your statement "While I could write something myself using non-blocking streams":
I would remind that each OS / File System (FS) may have its own vision about what and where to block. Like Windows blocks a file (write-block at leat) if one thread writes to it. On Linux is is configurable. So if you want to stick to something stable, I would suggest to think it out and go with your own wrapper (over FS) solution based on events, chunks, states.
I have used the Process class, issuing an operating system command to copy the file. Of course, one has to check under which OS the application is running, and issue the appropriate command, but this allows for fast and asynchronous copies.
As Marius rightly mentions in the comments, Scala Process blocks, so I run it wrapped in a Future.
Java 8 Process introduces a function isAlive(). A non-blocking alternative would be to use Java 8 processes and use the scheduler to poll at regular intervals to see if the process has finished. However, I did no need to go to this extent.
Have you checked out the async stuff in scala-io?
http://jesseeichar.github.io/scala-io-doc/0.4.2/index.html#!/core/async%20read%20write

Form-related problems

I am new to Lift and I am thinking whether I should investigate it more closely and start using it as my main platform for the web development. However I have few "fears" which I would be happy to be dispelled first.
Security
Assume that I have the following snippet that generates a form. There are several fields and the user is allowed to edit just some of them.
def form(in : NodeSeq): NodeSeq = {
val data = Data.get(...)
<lift:children>
Element 1: { textIf(data.el1, data.el1(_), isEditable("el1")) }<br />
Element 2: { textIf(data.el2, data.el2(_), isEditable("el2")) }<br />
Element 3: { textIf(data.el3, data.el3(_), isEditable("el3")) }<br />
{ button("Save", () => data.save) }
</lift:children>
}
def textIf(label: String, handler: String => Any, editable: Boolean): NodeSeq =
if (editable) text(label, handler) else Text(label)
Am I right that there is no vulnerability that would allow a user to change a value of some field even though the isEditable method assigned to that field evaluates to false?
Performance
What is the best approach to form processing in Lift? I really like the way of defining anonymous functions as handlers for every field - however how does it scale? I guess that for every handler a function is added to the session with its closure and it stays there until the form is posted back. Doesn't it introduce some potential performance issue when it comes to a service under high loads (let's say 200 requests per second)? And when do these handlers get freed (if the form isn't resubmitted and the user either closes the browser or navigate to another page)?
Thank you!
With regards to security, you are correct. When an input is created, a handler function is generated and stored server-side using a GUID identifier. The function is session specific, and closed over by your code - so it is not accessible by other users and would be hard to replay. In the case of your example, since no input is ever displayed - no function is ever generated, and therefore it would not be possible to change the value if isEditable is false.
As for performance, on a single machine, Lift performs incredibly well. It does however require session-aware load balancing to scale horizontally, since the handler functions do not easily serialize across machines. One thing to remember is that Lift is incredibly flexible, and you can also create stateless form processing if you need to (albeit, it will not be as secure). I have never seen too much of a memory hit with the applications we have created and deployed. I don't have too many hard stats available, but in this thread, David Pollak mentioned that demo.liftweb.net at the time had 214 open sessions consuming about 100MB of ram (500K/session).
Also, here is a link to the Lift book's chapter on Scalability, which also has some more info on security.
The closure and all the stuff is surely cleaned at sessionShutdown. Earlier -- I don't know. Anyway, it's not really a theoretical question -- it highly depends on how users use web forms in practice. So, for a broader answer, I'd ask the question on the main channel of liftweb -- https://groups.google.com/forum/#!forum/liftweb
Also, you can use a "statical" form if you want to. But AFAIK there are no problems with memory and everybody is using the main approach to forms.
If you don't create the handler xml/html -- the user won't be able to change the data, that's for sure. In your code, if I understood it correctly (I'm not sure), you don't create "text(label,handler)" when it's not needed, so everything's secure.