ipyparallel: what is the most reliable way to clear/reset engine namespaces without restarting the cluster? - ipython

As far as I understand, an ipython cluster manages a set of persistent namespaces (one per engine). As a result, if a module that is imported by an engine engine_i is modified, killing the main interpreter is not sufficient for that change to be reflected in the namespace of engine_i.
Here's a toy example that illustrates this:
#main.py
from ipyparallel import Client
from TC import test_class #TC is defined in the next code block
if __name__=="__main__":
cl=Client()
cl[:].execute("import TC")
lv=cl.load_balanced_view()
lv.block=True
tc=test_class()
res=lv.map(tc, [12,45])
print(res)
with the TC module only consisting of
#TC.py
class test_class:
def __call__(self,y):
return -1
Here, consider the excution
$npcluster start -n <any_number_of_engines> --daemonize
$python3 main.py
[-1, -1]
$#open some editor and modify test_class.__call__ so that it returns -2 instead of -1
$python3 main.py #output is as expected, still [-1, -1] instead of [-2, -2]
[-1, -1]
This is expected as the engines have their own persistent namespaces, and a trivial solution to make sure that changes to TC are included in the engines is simply to kill (e.g. via $ipcluster stop) and restart them again before running the script.
However, killing/restarting engines quickly becomes tedious in case you need to frequently modify a module. So, far, I've found a few potential solutions but none of them are really useful:
If the modification is made to a module directly imported to the engine's namespace, like TC above:
cl[:].execute("from imp import reload; import TC; reload(TC)")
However, this is very limited as it is not recursive (e.g. if TC.test_class.__call__ itself imports another_module and we modify another_module, then this solution wont work).
Because of the problem with the previous solution, I tried ipython's deepreload in combination with %autoreload:
from IPython import get_ipython
ipython=get_ipython()
ipython.magic("%reload_ext autoreload")
ipython.magic("%autoreload 2")
cl[:].execute("import builtins;from IPython.lib import deepreload;builtins.reload=deepreload.reload;import TC;reload(TC)")
This doesn't seem to work at all for reasons that so far I haven't understood.
The magic %reset from ipython is supposed to (per the documentation)) clear the namespace, but it didn't work on the engine namespaces including in the toy example given above.
I tried to adapt the first answer given here to clean up the engine namespaces. It doesn't seem however to help with re-importing modified modules.
It seems to me that the most reliable solution is therefore to just kill/restart the engines each time. It looks like this can't even be done from the script as cl.shutdown(restart=True) throws NotImplementedError. Is everyone working with ipyparallel constanty restarting their clusters manually or is there something obvious that I'm missing?

To clear the namespaces of the engines, ipyparallel's Client objects (as well as DirectView and BroadcastView objects) have a clear() method (documentation) that does exactly that.
For instance:
>>> from ipyparallel import Client
>>> client = Client()
>>> dview = client[:]
>>> dview.block = True
>>> dview.execute('import TC')
<AsyncResult: execute:finished>
>>> dview.apply(dir)
[['In', 'Out', 'TC', '_6f3c4b7b7576b8f6a12531042d4da9e4_5_args', '_6f3c4b7b7576b8f6a12531042d4da9e4_5_f', '_6f3c4b7b7576
b8f6a12531042d4da9e4_5_kwargs', '_6f3c4b7b7576b8f6a12531042d4da9e4_5_result', '__builtin__', '__builtins__',
...
>>> client.clear(client.ids)
<Future at 0x2576553edf0 state=pending>
# The TC module is gone. What remains are built-in symbols, as well as some variables created when using apply()
>>> dview.apply(dir)
[['In', 'Out', '_6f3c4b7b7576b8f6a12531042d4da9e4_13_args', '_6f3c4b7b7576b8f6a12531042d4da9e4_13_f', '_6f3c4b7b7576b8f6
a12531042d4da9e4_13_kwargs', '_6f3c4b7b7576b8f6a12531042d4da9e4_13_result', '__builtin__', '__builtins__', '__name__', '
...
However, this function doesn't help with reloading a module on an engine, which is more of what you're really attempting to do, because Python caches loaded modules.
There doesn't seem to be one way to reload modules that always works; in addition to the question you linked, this question, this question and this question give some solutions for different situations.

Related

dynamic import with interpolated string

I'm trying out parcel in a hobby project, having worked with create-react-app (i.e. webpack) before. I have had a great experience with dynamic imports of the following sort:
const Page = React.lazy(() => import(`./${page}`));
This is in a wrapper component that takes care of the suspense etc. and gets page as a prop (always a literal string, no variable/expression. not sure if that makes a difference).
With webpack this works wonderfully, even though I'm not sure how. Each such page I hit in the app gets loaded the first time, then its available instantly. I understand this is quite hard for the bundler to figure out, but yeah, it works.
When I try the same with parcel, it still builds but fails at runtime. If I dynamically import e.g. './SomePage', that is exactly what is requested from the server (GET /SomePage), which of course serves index.html. This happens both on the dev server and with a build. The build also only produces one .js file, so it doesn't split at all.
Is this even possible with parcel to import like this? Am I missing some configuration (don't have any at the moment)?
I think there is an unfix bug in webpack that dynamic import with interpolation will be having issue due to they are not able to pinpoint the exact file path.
Source
Webpack will collects all the dependencies, then check the import statement, parse the params to a reg statement, like this:
import('./app'+path+'/util') => /^\.\/app.*\/util$/
Webpack look for modules that meet the criteria based on reg. So if the param is a variable, Webpack will use the wrong reg statement to look for modules, in other words, will look for modules globally.
Try adding an empty string and append the interpolation might help you on this.
const Page = React.lazy(() => import("" + `./${page}`));

Managing Scala dependencies in Databricks notebooks

I'm a new dev on a big Scala project where all the code is stored as notebook and run inside Databricks Clusters...
Each notebook defines classes and methods, and we have 'Main' notebooks which have very few lines of codes, but execute all needed Scala notebooks (i.e. nearly all the notebooks in this project) in cells such as %run ./myPackage/Foo. Then these 'Main' notebooks have one little Scala code cell like this:
import com.bar.foo.Main
Main.main()
Furthermore, each notebook imports the package it needs, as Scala instructions import com.bar.foo.MyClass.
I find this really annoying:
If I move one notebook I must update all the %run path/Notebook commands inside all my main notebooks/test notebooks
I feel that it's redundant to run notebooks inside the main notebooks and import the package inside all the other notebooks.
Do you know another workflow? Is there a simpler way to work with multiple Scala notebooks inside Databricks?
I think that these issues occur when users and companies consider notebooks a replacement of software engineering principles. The software world in order to address these issues created and uses extensively the design patterns which is hard (if not impossible) to apply them with notebooks. Therefore, I think that users shouldn't handle notebooks as a tool to develop their end-user solutions. The main role of the notebooks used to be for prototyping and ML testing therefore by definition they are not suitable for cases where modularity and scalability are important factors.
As for your case and presuming that the usage of the notebooks is unavoidable I would suggest to minimize the usage of notebooks and start organising your code into JAR libraries. That would be useful if the notebooks share a significant part of the code between them.
Let's consider for instance the case where the notebook N1 and N2 are both using the notebooks N3 and N4. You could then place the implementation of N3 and N4 into a JAR, let's call it common_lib.jar and then make common_lib.jar available to both N1 and N2 by attaching it to the cluster where they run (assuming that you run a notebook job). By following this approach you achieve:
Better modularity since you completely separate the functionality of your notebooks. Also for each job/notebook you can attach the exact dependencies to the cluster avoiding redundant dependencies that occurs because of the difficulty to separate your notebook application into modules.
More maintainable code. Eventually you should have one final notebook per module that imports the dependencies as you would do in a common scala application avoiding the complex hierarchy that is required by calling multiple notebooks.
More scalable code. The notebooks provide a poor interface dbutils.widget.text(...) and dbutils.widget.get(...) is definitely much less that what you can achieve with scala/java.
More testable code. You should know by now that with notebooks is very hard to implement proper unit or integration testing. By having the main implementation into a jar you could execute unit testing as you would do with any scala/java application.
UPDATE
One solution for your case (refactoring to JAR libraries is not possible) would be to organise the notebooks into modules where each one will use an _includes_ file responsible for all the dependencies of the module. The _includes_ file could look as the next snippet:
%run "myproject/lib/notebook_1"
%run "myproject/lib/notebook_3"
...
Now let's assume that notebooks X1 and X2 they share the same dependencies myproject/lib/notebook_1 and myproject/lib/notebook_3 in order to use the mentioned dependencies you should just place the _includes_ file under the same folder and execute:
%run "_includes_"
in the first cell of the X1 and/or X2 notebook. In this manner you have a common way to include all the dependencies of your project and you avoid cases where you need to copy/paste all the includes repeatedly.
This doesn't provide an automated way to check and include the correct path of the dependencies in your project although it could be a significant improvement. By the way I am not aware of such a automated way for going through the files and changing the imports dynamically. One way though is to write an external custom script. Although this script it shouldn't be invoked through your job.
Note: you must ensure that the hierarchy of the dependencies is well defined and you don't have any circular dependencies.

Pycharm: Not Finding Pika Library (in path)

I spent 4 hours, on something simple, trying to figure out why pycharm did not find my pika library when running from inside development environment. The answer became obvious once found but for all you who are suffering from this simple issue try this:
Pycharm -> Run -> Configurations
Uncheck
Add content roots to PYTHONPATH
Add source roots to PYTHONPATH
Run/Debug Configurations
These settings should not result in you not finding the library in your PATH.
It's possible you have files in your project which mirror the names of the library or are otherwise interfering with resolution of the import name. You really should try to fix this issue right here, or you may find yourself having to debug even stranger problems after you send the code along to someone else.
Let's say that you're trying to run :
>>> import foo
This will look for foo.py, or a folder named foo containing __init.py__ in your PYTHONPATH.
If your own code also contains foo.py (or a folder named foo containing __init.py__), python will import your own module instead of the site package you're actually trying to import.
This may seemingly work without error, but if you were instead to do :
>>> from foo import fooclass
This class does not exists in your library, and therefore you're going to get an ImportError.
Similarly, if you did :
>>> import foo
>>> c = foo.fooclass()
You should get an AttributeError
Adding your source roots to PYTHONPATH is a fairly common requirement, and something you may need if your project grows beyond a few files. Not being able to do that can result in some really laborious workarounds in the future.

How to load external in Pure Data 0.46-7 (Mac) correctly?

I'm having some trouble trying to load zexy and iemlib into Pd Vanilla 0.46-7. I had no problems compiling and installing cyclone from https://github.com/electrickery/pd-cyclone. It works fine. So I tried installing iemlib and zexy from https://github.com/iem-projects/pd-iem using their binaries but there's something wrong going on. When I turn on "verbose" under path preferences, PD seems to be looking for a file with the same name as the object I'm trying to use. Using [zexy/multiplex] in a patch gives:
tried ~/Library/Pd/zexy/multiplex.d_fat and failed
tried ~/Library/Pd/zexy/multiplex.pd_darwin and failed
tried ~/Library/Pd/zexy/multiplex/multiplex.d_fat and failed
But there's no multiplex.d_fat only zexy.d_fat. Same with iemlib, there's no dollarg.d_fat or dollarg.pd_darwin only iem_mp3.d_fat, iem_t3_lib.d_fat, iemlib1.d_fat, and iemlib2.d_fat. I'm guessing these files are where the externals were compiled in.
I tried using deken and iemlib installs the .pd_darwin files but I guess this is an older version(?) and zexy is still installing zexy.d_fat so I can't load its objects.
I also tried loading the lib "zexy/zexy" under startup preferences and it loads ok but then I get messages like:
warning: class 'abs~' overwritten; old one renamed 'abs~_aliased'
and I seem to loose namespace functionality, I can no longer refer to [zexy/multiplex] and need to use only [multiplex], which I guess is the correct behaviour.
How does Pd know how to look for objects on files with different names?
Any advice?
This thread is marked as solved http://forum.pdpatchrepo.info/topic/9677/having-trouble-with-deken-plugin-and-zexy-library-solved and sounds like a similar problem but I haven't been successful.
zexy is built as a multi-object library, so there is no separate binary for zexy/multiplex.
As you have correctly guessed, the correct way to load zexy is a whole (either using [declare -lib zexy] in your patch or adding zexy to the startup libs (no need to use zexy/zexy)), and ignore the warning about abs~.
as for how loading works:
Pd maintains a list of objects it knows how to create. e.g. whenever you create [pack], Pd will lookup pack in its list of known-objects and use the information found there to actually create the object.
If you try to create an object that Pd doesn't know about yet (e.g. [foo]), then Pd will look for a library named foo (e.g: foo.pd_linux) and if found, it will "load" it.
loading a library means that it will call a special function in the library (this special function is the entry point of the library and is called foo_setup() in our case)
after that, Pd will check, whether it now has foo in the list of known objects. if it does, it will create the object.
Now the magic is done in the special function, that is called when Pd loads the library: this function's main purpose is to tell Pd about new objects (basically saying: "if somebody asks for object "foo", i can make one or you").
When zexy's special function is loaded, it tells Pd about all zexy objects (including multiplex), so after Pd has loaded zexy, it knows how to create the [multiplex] object.
If the special function registers an object that Pd already knows about (e.g. in the case of zexy it tries to register a new object abs~ even though Pd already has a built-in object of the same name), then Pd will rename the original object by appending _aliased and the newly registered object will take over the name.

Disabling nagle in python: how to do it the right way?

I need to disable nagle algorithm in python2.6.
I found out that patching HTTPConnection in httplib.py that way
def connect(self):
"""Connect to the host and port specified in __init__."""
self.sock = socket.create_connection((self.host,self.port),
self.timeout)
self.sock.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, True) # added line
does the trick.
Obviously, I would like to avoid patching system lib if possible. So, the question is: what is right way to do such thing? (I'm pretty new to python and can be easily missing some obvious solution here)
Please note that if using the socket library directly, the following is sufficient:
self.socket.setsockopt(socket.IPPROTO_TCP, socket.TCP_NODELAY, True)
I append this information to the accepted answer because it satisfies the information need that brought me here.
It's not possible to change the socket options that httplib specifies, and it's not possible to pass in your own socket object either. In my opinion this sort of lack of flexibility is the biggest weakness of most of the Python HTTP libraries. For example, prior to Python 2.6 it wasn't even possible to specify a timeout for the connection (except by using socket.setdefaulttimeout() globally, which wasn't very clean).
If you don't mind external dependencies, it looks like httplib2 already has TCP_NODELAY specified.
You could monkey-patch the library. Because python is a dynamic language and more or less everything is done as a namespace lookup at runtime, you can simply replace the appropriate method on the relevant class:
:::python
import httplib
def patch_httplib():
orig_connect = httplib.HTTPConnection.connect
def my_connect(self):
orig_connect(self)
self.sock.setsockopt(...)
However, this is extremely error-prone as it means that your code becomes quite specific to a particular Python version, as these library functions and classes do change. For example, in 2.7 there's a _tunnel() method called which uses the socket, so you'd want to hook in the middle of the connect() method - monkey-patching makes that extremely tricky.
In short, I don't think there's an easy answer, I'm afraid.