Managing Scala dependencies in Databricks notebooks - scala

I'm a new dev on a big Scala project where all the code is stored as notebook and run inside Databricks Clusters...
Each notebook defines classes and methods, and we have 'Main' notebooks which have very few lines of codes, but execute all needed Scala notebooks (i.e. nearly all the notebooks in this project) in cells such as %run ./myPackage/Foo. Then these 'Main' notebooks have one little Scala code cell like this:
import com.bar.foo.Main
Main.main()
Furthermore, each notebook imports the package it needs, as Scala instructions import com.bar.foo.MyClass.
I find this really annoying:
If I move one notebook I must update all the %run path/Notebook commands inside all my main notebooks/test notebooks
I feel that it's redundant to run notebooks inside the main notebooks and import the package inside all the other notebooks.
Do you know another workflow? Is there a simpler way to work with multiple Scala notebooks inside Databricks?

I think that these issues occur when users and companies consider notebooks a replacement of software engineering principles. The software world in order to address these issues created and uses extensively the design patterns which is hard (if not impossible) to apply them with notebooks. Therefore, I think that users shouldn't handle notebooks as a tool to develop their end-user solutions. The main role of the notebooks used to be for prototyping and ML testing therefore by definition they are not suitable for cases where modularity and scalability are important factors.
As for your case and presuming that the usage of the notebooks is unavoidable I would suggest to minimize the usage of notebooks and start organising your code into JAR libraries. That would be useful if the notebooks share a significant part of the code between them.
Let's consider for instance the case where the notebook N1 and N2 are both using the notebooks N3 and N4. You could then place the implementation of N3 and N4 into a JAR, let's call it common_lib.jar and then make common_lib.jar available to both N1 and N2 by attaching it to the cluster where they run (assuming that you run a notebook job). By following this approach you achieve:
Better modularity since you completely separate the functionality of your notebooks. Also for each job/notebook you can attach the exact dependencies to the cluster avoiding redundant dependencies that occurs because of the difficulty to separate your notebook application into modules.
More maintainable code. Eventually you should have one final notebook per module that imports the dependencies as you would do in a common scala application avoiding the complex hierarchy that is required by calling multiple notebooks.
More scalable code. The notebooks provide a poor interface dbutils.widget.text(...) and dbutils.widget.get(...) is definitely much less that what you can achieve with scala/java.
More testable code. You should know by now that with notebooks is very hard to implement proper unit or integration testing. By having the main implementation into a jar you could execute unit testing as you would do with any scala/java application.
UPDATE
One solution for your case (refactoring to JAR libraries is not possible) would be to organise the notebooks into modules where each one will use an _includes_ file responsible for all the dependencies of the module. The _includes_ file could look as the next snippet:
%run "myproject/lib/notebook_1"
%run "myproject/lib/notebook_3"
...
Now let's assume that notebooks X1 and X2 they share the same dependencies myproject/lib/notebook_1 and myproject/lib/notebook_3 in order to use the mentioned dependencies you should just place the _includes_ file under the same folder and execute:
%run "_includes_"
in the first cell of the X1 and/or X2 notebook. In this manner you have a common way to include all the dependencies of your project and you avoid cases where you need to copy/paste all the includes repeatedly.
This doesn't provide an automated way to check and include the correct path of the dependencies in your project although it could be a significant improvement. By the way I am not aware of such a automated way for going through the files and changing the imports dynamically. One way though is to write an external custom script. Although this script it shouldn't be invoked through your job.
Note: you must ensure that the hierarchy of the dependencies is well defined and you don't have any circular dependencies.

Related

How to easily play around with the classes in an Scala/SBT project?

I'm new to Scala/SBT and I'm having trouble understanding how to just try out the classes and functions of a package to see what they're about, to get a feel for them. For example, take https://github.com/plokhotnyuk/rtree2d . What I want to do is something like (in the top level folder of the project)
# sbt
> console
> import com.github.plokhotnyuk.rtree2d.core._
...
etc. But this won't work as it can't find the import even though this is in the project. I apologize for the vague question, though I hope from my hand-waving it's clear what I want to do. Another way to put it maybe, is that I'm looking for something like the intuitive ease of use which I've come to take for granted in Python -- using just bash and the interpreter. As a last resort I can create a separate project and import this package and write a Main object but this seems much too roundabout and cumbersome for what I want to do. I'd also like if possible to avoid IDEs, since I never really feel in control with them as they do all sorts of things behind the scenes in the background adding a lot of bulk and complexity.
rtree2d takes advantage of sbt's multi-module capabilities; a common use for this is to put the core functionality in a module and have less core aspects (e.g. higher-level APIs or integrations with other projects) in modules which depend on the core: all of these modules can be published independently and have their own dependencies.
This can be seen in the build.sbt file:
// The main project -- LR
lazy val rtree2d = project.in(file("."))
.aggregate(`rtree2d-coreJVM`, `rtree2d-coreJS`, `rtree2d-benchmark`)
// details omitted --LR
// Defines the basic skeleton for the core JVM and JS projects --LR
lazy val `rtree2d-core` = crossProject(JVMPlatform, JSPlatform)
// details omitted
// Turns the skeleton into the core JVM project --LR
lazy val `rtree2d-coreJVM` = `rtree2d-core`.jvm
// Turns the skeleton into the core JS project --LR
lazy val `rtree2d-coreJS` = `rtree2d-core`.js
lazy val `rtree2d-benchmark` = project
In sbt, commands can be scoped to particular modules with module/command, so in the interactive sbt shell (from the top-level), you can do
> rtree2d-coreJVM/console
to run the console within the JVM core module. You could also run sbt 'rtree2d-coreJVM/console' directly from the shell in the top level, though this may require some care around shell quoting etc.

More than one V4L-DVB driver on the same host machine

I have a question related to V4L-DVB drivers. Following the
Building/Compiling the Latest V4L-DVB Source Code link, there are 3 ways to
compile. I am curious about the last approach (More "Manually
Intensive" Approach). It allows me to choose the components that I
wish to build and install using the "make menuconfig". Some of these components (i.e. "CONFIG_MEDIA_ATTACH") are used in pre-processor directives that define a function in one shape if defined, and a function in another if not defined (i.e.
dvb_attach, dvb_detach) in the resulting modules (i.e. dvb_core.ko)
that will be loaded by most of the DVB drivers. What happens if there are two
drivers (*.ko modules) on the same host machine, one that needs dvb_core.ko with
CONFIG_MEDIA_ATTACH defined and another that needs dvb_core.ko with
CONFIG_MEDIA_ATTACH undefined, is there a clean way to handle this?
What is also not clear to me is: Since the V4L compilation environment seems very customizable (by setting the .config file), if I develop a driver using V4L-DVB structures, there is a big chance that it has conflicts with other drivers since each driver has its own custom settings. Is my understanding correct?
Thanks!
Dave

How do Atom's 'spec' files work?

I'm making a package for Atom, and Travis CI keeps telling me my build failed.
Update: I created a blank spec file and now my builds are passing.
You can see my package here: https://travis-ci.org/frayment/language-jazz
The console is telling me:
sh: line 105: ./spec: No such file or directory
Missing spec folder! Please consider adding a test suite in
I went looking around at Atom packages on GitHub for 'spec' files and they seem to be CoffeeScript based, but I can't understand what on earth they contain. There isn't much documentation on the subject, so:
What is a 'spec' file, and what do I put in it?
Help is very appreciated.
The ./spec directory should contain one or more Jasmine Specifications for the Atom Package you are developing, for example, this spec is taken from the Atom documentation:
describe "when a test is written", ->
it "has some expectations that should pass", ->
expect("apples").toEqual("apples")
expect("oranges").not.toEqual("apples")
One of the biggest challenges with Open Source software is maintaining quality when a large number of individual contributors are providing code, one solution to this is providing a high level of test coverage:
Like most aspects of programming, testing requires thoughtfulness. TDD is a very useful, but certainly not sufficient, tool to help you get good tests. If you are testing thoughtfully and well, I would expect a coverage percentage in the upper 80s or 90s. I would be suspicious of anything like 100% - it would smell of someone writing tests to make the coverage numbers happy, but not thinking about what they are doing.
In Atom's case, all of the specifications are added to the ./spec folder and must end with -spec.coffee, so for example if you were creating a package named awesome and your code sat within /awesome.coffee then you spec would be ./spec/awesome.coffee. Your spec should exercise the key areas of your code to give you confidence when committing pull requests to your master branch.
I have a couple of packages on Atom.io and both of these have tests included with them, you are welcome to use these as concrete examples of how Jasmine 1.3 tests can be written to support the functionality of your packages. Equally the majority of packages on Atom.io also have a set of tests that you can draw upon to build your own test suite.

How can I perform dynamic reconfiguration in Scala (like Dyre or XMonad)?

A fairly common method of configuration for Haskell applications is having the program as a library, with a main function provided with a bunch of optional parameters for configuration. Upon being run, the executable itself looks for a dotfile containing a main function using this default function, which it then compiles and run instead. This sort of configuration scheme allows the user to add arbitrarily complex functionality without recompiling the entire program. Examples of this are the Dyre library and the XMonad window manager. How can this be done in Scala cleanly? It appears that SBT does something similarly internally.
Using SBT externally would require having the sources of the whole program somewhere, and lacks the cleanliness of just having a single dotfile. Typesafe config, Configrity, Bee Config, and fig all seem to only be meant for normal string based configuration.
https://github.com/typesafehub/config is a great config library.
supports files in three formats: Java properties, JSON, and a human-friendly JSON superset

ESS workflow for R project/package development

Can anyone share his experience on workflow for R peject development under ESS? I tried several times to learn emacs but I have not get it yet. I can understand ESS as an editor, but is there a project view in ESS? what's the efficient ways to set up/view R project directory, coding, and testing, and how's ESS has an edge to facilitate the whole process?
Do you use ESS as a good R editor only or tend to emulate a R IDE environment within ESS?
Thanks for any advices.
It sounds like you're asking two separate questions.
One question concerns workflow and the other concerns using ESS.
As I use StatET and Eclipse, I'll just share my experience regarding the workflow aspect of your question.
As with Vincent I also follow something like the workflow set out by Josh Reich here (also see Hadley's useful comments):
Workflow for statistical analysis and report writing
Although it can vary between projects, I tend to have a couple of main R files
import.R: this imports data files and does any necessary cleaning and manipulation
analyse.R: This generates the output that I need for any final report
main.R: This calls import.R and analyse.R
The aim is for import.R and analyse.R to represent the complete and final workflow for producing the final results of any analyses.
In terms of a directory structure for an analysis project, I'll often also have the following folders
data: for storing any raw data files
meta: for storing meta data, such as variable labels, scoring systems for tests, recoding information, etc.
output: for storing any graphics, tables, or text generated by my analyses that I might want to incorporate into an external program
temp: When exploring the data and brainstorming analyses, I like to type code into files instead of using the console. I tend to label these temp1.R, temp2.R, temp3.R. I store these in a temp folder. That way I have a permanent record that's easily accessible. If the analyses become final they get incorporated into one of the main R files (i.e., import.R or analysis.R)
functions: If I think that a function will be needed across a couple of projects, I often place it one function per file or a set of related functions in a file in a folder called functions. This makes it relatively easy to reuse functions across projects, when the formal requirements of package development are more than needed.
library: If I want to create some general functions that I think will be project specific, I'll place them in this folder
save: A folder to store any saved R objects
StatET and Eclipse make it easy to interact with such a file system.
Of course, given all the R gurus that use ESS and Emacs, I'm sure it also handles interactions with the file system well.
I'm not exactly sure what you expect as an answer on this one. I, for one, have stolen (and adapted) a system that was suggested here a little while ago (by Josh Reich):
Create a folder for every project, and split up your work in a bunch of different .R files:
Load.R for getting your raw data into R;
Prep.R for cleaning the data, recoding variables, etc.;
Func.R for coding any custom functions you will need for evaluation; and
Eval.R for running your final stuff.
If that doesn't fit your style, just change it.
Then, you can either have a master file to call each of the parts one after each other (good for reproducibility), or save at different stages and have the individual scripts load the appropriate data (good if some of the prep work is very computationally/time intensive).
**
On a different note, the trick that is posted at the link really helped me get into ESS. It turns Shift-Enter into a one-stop-ESS-shop: http://www.kieranhealy.org/blog/archives/2009/10/12/make-shift-enter-do-a-lot-in-ess/
Others have given you some good ideas about how to setup your directory/file structure for a project.
You also asked about "project views," in which case you might want to look into the Emacs Code Browser (ECB).
You can find some screen shots of it in action on its site, here:
http://ecb.sourceforge.net/screenshots/index.html