How to run a Scio pipeline on Dataflow from SBT (local) - scala

I am trying to run my first Scio pipeline on Dataflow .
The code in question can be found here. However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner. That worked as expected.
Now, I am trying to read the files from GCS, write the output to BigQuery and run the pipeline using the DataflowRunner. I already made all the necessary changes (or that is what I believe). But I am unable to make it run.
I already gcloud auth application-default login and when I do
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table
I can see the Jb is submitted in Dataflow. However, after one hour the jobs fails with the following error message.
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes).
Checking the StackDriver I can find the follow error:
java.lang.ClassNotFoundException: scala.collection.Seq
Related to some jackson thing:
java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated
And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.
I also tried to first create a template and runt it latter with:
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1
But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.
I am missing something obvious?

It's a known issue with sbt 1.3.0 which introduced some breaking change w.r.t. class loaders. Try 1.2.8?
Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.

Fix by setting the sbt classLoaderLayeringStrategy:
run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat
sbt uses a new classloader for the application that is run with run. This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. See in-process classloaders for details.
This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51:
Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.
So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.
Hope that helps

Related

How to re-use compiled sources in different machines

To speed up our development workflow we split the tests and run each part on multiple agents in parallel. However, compiling test sources seem to take most of the time for the testing steps.
To avoid this, we pre-compile the tests using sbt test:compile and build a docker image with compiled targets.
Later, this image is used in each agent to run the tests. However, it seems to recompile the tests and application sources even though the compiled classes exists.
Is there a way to make sbt use existing compiled targets?
Update: To give more context
The question strictly relates to scala and sbt (hence the sbt tag).
Our CI process is broken down in to multiple phases. Its roughly something like this.
stage 1: Use SBT to compile Scala project into java bitecode using sbt compile We compile the test sources in the same test using sbt test:compile The targes are bundled in a docker image and pushed to the remote repository,
stage 2: We use multiple agents to split and run tests in parallel.
The tests run from the built docker image, so the environment is the
same. However, running sbt test causes the project to recompile even
through the compiled bitecode exists.
To make this clear, I basically want to compile on one machine and run the compiled test sources in another without re-compiling
Update
I don't think https://stackoverflow.com/a/37440714/8261 is the same problem because unlike it, I don't mount volumes or build on the host machine. Everything is compiled and run within docker but in two build stages. The file modified times and paths are retained the same because of this.
The debug output has something like this
Initial source changes:
removed:Set()
added: Set()
modified: Set()
Invalidated products: Set(/app/target/scala-2.12/classes/Class1.class, /app/target/scala-2.12/classes/graph/Class2.class, ...)
External API changes: API Changes: Set()
Modified binary dependencies: Set()
Initial directly invalidated classes: Set()
Sources indirectly invalidated by:
product: Set(/app/Class4.scala, /app/Class5.scala, ...)
binary dep: Set()
external source: Set()
All initially invalidated classes: Set()
All initially invalidated sources:Set(/app/Class4.scala, /app/Class5.scala, ...)
Recompiling all 304 sources: invalidated sources (266) exceeded 50.0% of all sources
Compiling 302 Scala sources and 2 Java sources to /app/target/scala-2.12/classes ...
It has no Initial source changes, but products are invalidated.
Update: Minimal project to reproduce
I created a minimal sbt project to reproduce the issue.
https://github.com/pulasthibandara/sbt-docker-recomplile
As you can see, nothing changes between the build stages, other than running in the second stage in a new step (new container).
While https://stackoverflow.com/a/37440714/8261 pointed at the right direction, the underlying issue and the solution for this was different.
Issue
SBT seems to recompile everything when it's run on different stages of a docker build. This is because docker compresses images created in each stage, which strips out the millisecond portion of the lastModifiedDate from sources.
SBT depends on lastModifiedDate when determining if sources have changed, and since its different (the milliseconds part) the build triggers a full recompilation.
Solution
Java 8:
Setting -Dsbt.io.jdktimestamps=true when running SBT as recommended in https://github.com/sbt/sbt/issues/4168#issuecomment-417655678 to workaround this issue.
Newer:
Follow recomendation in https://github.com/sbt/sbt/issues/4168#issuecomment-417658294
I solved the issue by setting SBT_OPTS env variable in the docker file like
ENV SBT_OPTS="${SBT_OPTS} -Dsbt.io.jdktimestamps=true"
The test project has been updated with this workaround.
Using SBT:
I think there is already an answer to this here: https://stackoverflow.com/a/37440714/8261
It looks tricky to get exactly right. Good luck!
Avoiding SBT:
If the above approach is too difficult (i.e. getting sbt test to consider that your test classes do not need re-compiling), you could instead avoid using sbt but instead run your test suite using java directly.
If you can get sbt to log the java command that it is using to run your test suite (e.g. using debug logging), then you could run that command on your test runner agents directly, which would completely preclude sbt re-compiling things.
(You might need to write the java command into a script file, if the classpath is too long to pass as a command-line argument in your shell. I have previously had to do that for a large project.)
This would be a much hackier approach that the one above, but might be quicker to get working.
A possible solution might be defining your own sbt task without dependencies or try to change the test task. For example you could create a task to run a JUnit runner if that was your testing framework. To define a task see this on Implementing Tasks.
You could even go as far as compiling sending the code and running the remotes from the same task as it is any scala code you want. From the sbt reference manual
You could be defining your own task, or you could be planning to redefine an existing task. Either way looks the same; use := to associate some code with the task key

how to disclude development.conf from docker image creation of play framework application artifact

Using scala playframework 2.5,
I build the app into a jar using sbt plugin PlayScala,
And then build and pushes a docker image out of it using sbt plugin DockerPlugin
Residing in the source code repository conf/development.conf (same where application.conf is).
The last line in application.conf says include development which means that in case development.conf exists, the entries inside of it will override some of the entries in application.conf in such way that provides all default values necessary for making the application runnable locally right out of the box after the source was cloned from source control with zero extra configuration. This technique allows every new developer to slip right in a working application without wasting time on configuration.
The only missing piece to make that architectural design complete is finding a way to exclude development.conf from the final runtime of the app - otherwise this overrides leak into production runtime and obviously the application fails to run.
That can be achieved in various different ways.
One way could be to some how inject logic into the build task (provided as part of the sbt pluging PlayScala I assume) to exclude the file from the jar artifact.
Other way could be injecting logic into the docker image creation process. this logic could manually delete development.conf from the existing jar prior to executing it (assuming that's possible)
If you ever implemented one of the ideas offered,
or maybe some different architectural approach that gives the same "works out of the box" feature, please be kind enough to share :)
I usually have the inverse logic:
I use the application.conf file (that Play uses by default) with all the things needed to run locally. I then have a production.conf file that starts by including the application.conf, and then overrides the necessary stuff.
for deploying to production (or staging) I specify the production/staging.conf file to be used
This is how I solved it eventually.
conf/application.conf is production ready configuration, it contains placeholders for environment variables whom values will be injected in runtime by k8s given the service's deployment.yaml file.
right next to it, conf/development.conf - its first line is include application.conf and the rest of it are overrides which will make the application run out of the box right after git clone by a simple sbt run
What makes the above work, is the addition of the following to build.sbt :
PlayKeys.devSettings := Seq(
"config.resource" -> "development.conf"
)
Works like a charm :)
This can be done via the mappings config key of sbt-native-packager:
mappings in Universal ~= (_.filterNot(_._1.name == "development.conf"))
See here.

JAI can't execute in native spark - only in sbt and as a separate scala function

I want to use a library (JAI) with spark to parse some spatial raster files. Unfortunately, there are some strange issues. JAI only works when running via the build tool i.e. sbt run when executed in spark.
When executed via spark-submit the error is:
java.lang.IllegalArgumentException: The input argument(s) may not be null.
at javax.media.jai.ParameterBlockJAI.getDefaultMode(ParameterBlockJAI.java:136)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:157)
at javax.media.jai.ParameterBlockJAI.<init>(ParameterBlockJAI.java:178)
at org.geotools.process.raster.PolygonExtractionProcess.execute(PolygonExtractionProcess.java:171)
Which looks like some native dependency is not called correctly.
Assuming something is wrong with the class path I tried to run a plain java/scala function. but this one works just fine.
In fact, the exact same problem occurs when Nifi is calling the parse function.
Is spark messing with the class paths? What is different from running the jar natively via java-jar or through spark or NiFi? Both show the same problem even when concurrency is disabled and they run only on a single thread.
JAI vendorname == null is somewhat similar as it shows what can go wrong when running a jar with JAI. I could not identify this as the exact same problem though.
I created a minimal example here:
https://github.com/geoHeil/jai-packaging-problem
Due to the dependency on the build process & packaging of native libraries I think it will not be possible to include snippets directly in this posting.
edit
I am pretty convinced this has to do the the assembly merge strategy, so far I could not find one which works.
Below you can see that the Vectorize operation is missing on sparks class path
edit 2
I think spark / NiFis class loader will not load some of the required registry files for JAI. A plain java app works fine with these assembly/ fat-jar settings.

Scala - sbt: Is it safe to compile while running?

I often have to run some time-consuming experiments in scala, and usually I run a second sbt
instance for the same project where I make changes to the code that is running in the other instance and compile.
The reason I do this is so that I don't have to wait for a long running process to finish before I make progress with my code.
My question is: Is it safe to do that, or is there a possibility that recompiling parts of currently running code in sbt/scala will cause problems in my running process?
What I have observed so far is that most of the time it is fine, but I did run into a class not defined error once when refactoring my code while running.
As #marcus mentioned, the compiler writing a .class file that has not yet been loaded by your running JVM stands the chance of being loaded and not matching the other compiled classes. In many instances you'll be fine, but it could cause problems. There are a few things you can do in this situation:
Compile in separate directories. Check your code out into two completely different directories and do local commits (assuming you're using git) to push/pull from one copy of the repository to another. This will ensure that your testing doesn't get the compilation changes until you're ready (when you "pull" from the development repository).
Use an automated CI system like Jenkins or Travis to run your tests on each commit. This will, similarly to #1, not conflict with your development work since it is a separate checkout of the code.
Use sbt-revolver which runs the program in a separate JVM with the re-start command and will restart it whenever there are changes. This would interrupt your testing, however.
Use JRebel which does a better job of reloading classes than the JVM or most IDEs.

Running a mapreduce jar on Hadoop cluster

I'm trying to run the map reduce implementation of quadratic sieve algorithm on Hadoop. For this purpose I'm using karmasphere Hadoop community plugin with Netbeans. The program works fine using the plugin. But I'm unable to run it on actual cluster.
I'm running this command
bin/hadoop jar MRIF.jar 689
Where MRIF.jar is the jar file made by building the netbeans project and 689 is number to be factored. The input and output directories are hard coded in program itself. When running on actual cluster, it appears that the inside java classes are not being processed as reduce completes to 100% before map being at 0% itself. And input and output files are created with no content.
But this works fine when running using Karmasphere plugin.
Try running it as bin/hadoop -jar MRIF.jar 689. The -jar forces it to run local and displays information to the console as well as logs to that machine. You can also check the Hadoop logs to see if they have any indicators of why it's not happening correctly.
When using -jar you can use System.out.println(...); to display information on the console, further helping to debug.
You can also use Hadoop Counters (link is random blog post I found) to assist in troubleshooting when running (psuedo-)distributed.
I admit this post isn't a 'solution' to the problem; Without more/further information about what is happening and where, there is a wide range of things that could be going on. If it is, as you mention, not processing the 'inside java classes' then it would likely be your implementation, of which we can't see to make suggestions, ect.
More data about the issue, such as logs, errors or output will likely assist in getting more solution-y responses instead of debugging tips. :)
EDIT: Thanks for the link to the files. I think your call is missing a component.
I looked in the run.sh and think this might get it to work for you:
bin/hadoop jar mrif.jar com.javiertordable.mrif.MapReduceQuadraticSieve 689