How to re-use compiled sources in different machines - scala

To speed up our development workflow we split the tests and run each part on multiple agents in parallel. However, compiling test sources seem to take most of the time for the testing steps.
To avoid this, we pre-compile the tests using sbt test:compile and build a docker image with compiled targets.
Later, this image is used in each agent to run the tests. However, it seems to recompile the tests and application sources even though the compiled classes exists.
Is there a way to make sbt use existing compiled targets?
Update: To give more context
The question strictly relates to scala and sbt (hence the sbt tag).
Our CI process is broken down in to multiple phases. Its roughly something like this.
stage 1: Use SBT to compile Scala project into java bitecode using sbt compile We compile the test sources in the same test using sbt test:compile The targes are bundled in a docker image and pushed to the remote repository,
stage 2: We use multiple agents to split and run tests in parallel.
The tests run from the built docker image, so the environment is the
same. However, running sbt test causes the project to recompile even
through the compiled bitecode exists.
To make this clear, I basically want to compile on one machine and run the compiled test sources in another without re-compiling
Update
I don't think https://stackoverflow.com/a/37440714/8261 is the same problem because unlike it, I don't mount volumes or build on the host machine. Everything is compiled and run within docker but in two build stages. The file modified times and paths are retained the same because of this.
The debug output has something like this
Initial source changes:
removed:Set()
added: Set()
modified: Set()
Invalidated products: Set(/app/target/scala-2.12/classes/Class1.class, /app/target/scala-2.12/classes/graph/Class2.class, ...)
External API changes: API Changes: Set()
Modified binary dependencies: Set()
Initial directly invalidated classes: Set()
Sources indirectly invalidated by:
product: Set(/app/Class4.scala, /app/Class5.scala, ...)
binary dep: Set()
external source: Set()
All initially invalidated classes: Set()
All initially invalidated sources:Set(/app/Class4.scala, /app/Class5.scala, ...)
Recompiling all 304 sources: invalidated sources (266) exceeded 50.0% of all sources
Compiling 302 Scala sources and 2 Java sources to /app/target/scala-2.12/classes ...
It has no Initial source changes, but products are invalidated.
Update: Minimal project to reproduce
I created a minimal sbt project to reproduce the issue.
https://github.com/pulasthibandara/sbt-docker-recomplile
As you can see, nothing changes between the build stages, other than running in the second stage in a new step (new container).

While https://stackoverflow.com/a/37440714/8261 pointed at the right direction, the underlying issue and the solution for this was different.
Issue
SBT seems to recompile everything when it's run on different stages of a docker build. This is because docker compresses images created in each stage, which strips out the millisecond portion of the lastModifiedDate from sources.
SBT depends on lastModifiedDate when determining if sources have changed, and since its different (the milliseconds part) the build triggers a full recompilation.
Solution
Java 8:
Setting -Dsbt.io.jdktimestamps=true when running SBT as recommended in https://github.com/sbt/sbt/issues/4168#issuecomment-417655678 to workaround this issue.
Newer:
Follow recomendation in https://github.com/sbt/sbt/issues/4168#issuecomment-417658294
I solved the issue by setting SBT_OPTS env variable in the docker file like
ENV SBT_OPTS="${SBT_OPTS} -Dsbt.io.jdktimestamps=true"
The test project has been updated with this workaround.

Using SBT:
I think there is already an answer to this here: https://stackoverflow.com/a/37440714/8261
It looks tricky to get exactly right. Good luck!
Avoiding SBT:
If the above approach is too difficult (i.e. getting sbt test to consider that your test classes do not need re-compiling), you could instead avoid using sbt but instead run your test suite using java directly.
If you can get sbt to log the java command that it is using to run your test suite (e.g. using debug logging), then you could run that command on your test runner agents directly, which would completely preclude sbt re-compiling things.
(You might need to write the java command into a script file, if the classpath is too long to pass as a command-line argument in your shell. I have previously had to do that for a large project.)
This would be a much hackier approach that the one above, but might be quicker to get working.

A possible solution might be defining your own sbt task without dependencies or try to change the test task. For example you could create a task to run a JUnit runner if that was your testing framework. To define a task see this on Implementing Tasks.
You could even go as far as compiling sending the code and running the remotes from the same task as it is any scala code you want. From the sbt reference manual
You could be defining your own task, or you could be planning to redefine an existing task. Either way looks the same; use := to associate some code with the task key

Related

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow .
The code in question can be found here. However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner. That worked as expected.
Now, I am trying to read the files from GCS, write the output to BigQuery and run the pipeline using the DataflowRunner. I already made all the necessary changes (or that is what I believe). But I am unable to make it run.
I already gcloud auth application-default login and when I do
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table
I can see the Jb is submitted in Dataflow. However, after one hour the jobs fails with the following error message.
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes).
Checking the StackDriver I can find the follow error:
java.lang.ClassNotFoundException: scala.collection.Seq
Related to some jackson thing:
java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated
And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.
I also tried to first create a template and runt it latter with:
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1
But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.
I am missing something obvious?
It's a known issue with sbt 1.3.0 which introduced some breaking change w.r.t. class loaders. Try 1.2.8?
Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.
Fix by setting the sbt classLoaderLayeringStrategy:
run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat
sbt uses a new classloader for the application that is run with run. This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. See in-process classloaders for details.
This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51:
Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.
So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.
Hope that helps

how to disclude development.conf from docker image creation of play framework application artifact

Using scala playframework 2.5,
I build the app into a jar using sbt plugin PlayScala,
And then build and pushes a docker image out of it using sbt plugin DockerPlugin
Residing in the source code repository conf/development.conf (same where application.conf is).
The last line in application.conf says include development which means that in case development.conf exists, the entries inside of it will override some of the entries in application.conf in such way that provides all default values necessary for making the application runnable locally right out of the box after the source was cloned from source control with zero extra configuration. This technique allows every new developer to slip right in a working application without wasting time on configuration.
The only missing piece to make that architectural design complete is finding a way to exclude development.conf from the final runtime of the app - otherwise this overrides leak into production runtime and obviously the application fails to run.
That can be achieved in various different ways.
One way could be to some how inject logic into the build task (provided as part of the sbt pluging PlayScala I assume) to exclude the file from the jar artifact.
Other way could be injecting logic into the docker image creation process. this logic could manually delete development.conf from the existing jar prior to executing it (assuming that's possible)
If you ever implemented one of the ideas offered,
or maybe some different architectural approach that gives the same "works out of the box" feature, please be kind enough to share :)
I usually have the inverse logic:
I use the application.conf file (that Play uses by default) with all the things needed to run locally. I then have a production.conf file that starts by including the application.conf, and then overrides the necessary stuff.
for deploying to production (or staging) I specify the production/staging.conf file to be used
This is how I solved it eventually.
conf/application.conf is production ready configuration, it contains placeholders for environment variables whom values will be injected in runtime by k8s given the service's deployment.yaml file.
right next to it, conf/development.conf - its first line is include application.conf and the rest of it are overrides which will make the application run out of the box right after git clone by a simple sbt run
What makes the above work, is the addition of the following to build.sbt :
PlayKeys.devSettings := Seq(
"config.resource" -> "development.conf"
)
Works like a charm :)
This can be done via the mappings config key of sbt-native-packager:
mappings in Universal ~= (_.filterNot(_._1.name == "development.conf"))
See here.

How to trigger tests related to a particular file on Scala SBT

I'm a JS developer and I'm learning Scala, I'm used to the Jest test framework, in JS, which allows you to watch changed files and run tests for the changed modules.
Is there something similar in Scala?
I know that I can do:
~test to watch all the files and trigger all the tests
~test package/test_class to watch all the files and trigger only the specified test class.
Thanks a lot
I think you're looking for testQuick:
The testQuick task, like testOnly, allows to filter the tests to run to specific tests or wildcards using the same syntax to indicate the filters. In addition to the explicit filter, only the tests that satisfy one of the following conditions are run:
The tests that failed in the previous run
The tests that were not run before
The tests that have one or more transitive dependencies, maybe in a different project, recompiled.
You can use it with ~ and filtering that you already know.

Is it necessary to "clean" SBT before "stage" for production builds?

We are using Jenkins and SBT pluging to build, test and deploy our application. I see everywhere that people use "clean" command before "test" or "stage" which leads to very long compile times.
Is it necessary to clean everything for each build? What is the risk of using the incremental compiler for production builds?
I think it's a matter of taste. For continuous integration builds, it's ok to rely on the incremental compiler; if something fails I can investigate quite quickly.
But if, for whatever reason, the incremental compiler decides to ignore some changes (and it already happened to me in development), you may get some difficult bugs to resolve.
So, if you want something that's reproducible from a fresh code checkout; just run "clean" before everything else.
Most of the time, I configure jenkins with two build tasks: test builds, triggered by code changes, and packaging builds targeted for deployment and triggered periodically (or manually in some cases, with automatic deployment).
If the build script and/or command line to sbt have not changed then it should not be necessary to do clean.
I have been using Spark for several years - and it is one of the most complex sbt builds you can find. It works just fine to avoid doing clean as long as the build process itself (as embodied in the project/sbt build artifacts and the sbt command line) are the same as previous runs.

Scala - sbt: Is it safe to compile while running?

I often have to run some time-consuming experiments in scala, and usually I run a second sbt
instance for the same project where I make changes to the code that is running in the other instance and compile.
The reason I do this is so that I don't have to wait for a long running process to finish before I make progress with my code.
My question is: Is it safe to do that, or is there a possibility that recompiling parts of currently running code in sbt/scala will cause problems in my running process?
What I have observed so far is that most of the time it is fine, but I did run into a class not defined error once when refactoring my code while running.
As #marcus mentioned, the compiler writing a .class file that has not yet been loaded by your running JVM stands the chance of being loaded and not matching the other compiled classes. In many instances you'll be fine, but it could cause problems. There are a few things you can do in this situation:
Compile in separate directories. Check your code out into two completely different directories and do local commits (assuming you're using git) to push/pull from one copy of the repository to another. This will ensure that your testing doesn't get the compilation changes until you're ready (when you "pull" from the development repository).
Use an automated CI system like Jenkins or Travis to run your tests on each commit. This will, similarly to #1, not conflict with your development work since it is a separate checkout of the code.
Use sbt-revolver which runs the program in a separate JVM with the re-start command and will restart it whenever there are changes. This would interrupt your testing, however.
Use JRebel which does a better job of reloading classes than the JVM or most IDEs.