Python module dependencies on cloud-dataproc - google-cloud-dataproc

i am trying to deploy my code on cloud-dataproc.
My app is made of two modules, moduleA.py and moduleB.py
moduleA import a function from modulB
I have uploaded both modules in the same bucket, however when i kick off my dataproc template , dataproc complains that it cannot find moduleB
WHat extra steps do i need to follow in order for my moduleA to see moduleB on dataproc?
kind regards

Apologies to all..I think I had some other unrelated errors in one of the steps I thought I deleted it, nothing to do with dependencies
Managed to have a successful run by packaging dependencies in a zip and run with. --py-files gs://mydeps.zip .....
Kind regards

Related

Have Rush add dev dependencies to new project

When adding a new project to a Rush monorepo, is there a way for Rush to automatically insert the dev dependencies into the package.json? For example I want to use the same test frameworks between projects so it would be good to have Rush sync the dev dependencies.
No, there is no way to do this. rush has no idea which package requires which dependencies and, as such, you'll need to add them manually to each.
However, once you've configured your package.json's accordingly, rush will help you maintain dependency versioning across your monorepo. The precise behaviour can be configured by:
setting preferredVersions in the common-versions.json file
using a version policy such as lockStepVersion
(I presume you found this answer already but in case any stumbles across this in the future)
If you run rush add -h you get the usage.
[usage: rush add [-h] -p PACKAGE [--exact] [--caret] [--dev] [-m] [-s] [--all]]
--dev If specified, the package will be added to the
"devDependencies" section of the package.json
The command you are looking for is
rush add -p PACKAGENAME --dev

How to run a Scio pipeline on Dataflow from SBT (local)

I am trying to run my first Scio pipeline on Dataflow .
The code in question can be found here. However I do not think that is too important.
My first experiment was to read some local CSV files and write another local CSV file, using the DirecRunner. That worked as expected.
Now, I am trying to read the files from GCS, write the output to BigQuery and run the pipeline using the DataflowRunner. I already made all the necessary changes (or that is what I believe). But I am unable to make it run.
I already gcloud auth application-default login and when I do
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table
I can see the Jb is submitted in Dataflow. However, after one hour the jobs fails with the following error message.
Workflow failed. Causes: The Dataflow job appears to be stuck because no worker activity has been seen in the last 1h.
(Note, the job did nothing in all that time, and since this is an experiment the data is simple too small to take more than a couple of minutes).
Checking the StackDriver I can find the follow error:
java.lang.ClassNotFoundException: scala.collection.Seq
Related to some jackson thing:
java.util.ServiceConfigurationError: com.fasterxml.jackson.databind.Module: Provider com.fasterxml.jackson.module.scala.DefaultScalaModule could not be instantiated
And that is what is killing each executor just at the start. I really do not understand why I can not find the Scala standard library.
I also tried to first create a template and runt it latter with:
sbt run --runner=DataflowRunner --project=project-id --input-path=gs://path/to/data --output-table=dataset.table --stagingLocation=gs://path/to/staging --templateLocation=gs://path/to/templates/template-1
But, after running the template, I get the same error.
Also, I noticed that in the staging folder there are a lot of jars, but the scala-library.jar is not in there.
I am missing something obvious?
It's a known issue with sbt 1.3.0 which introduced some breaking change w.r.t. class loaders. Try 1.2.8?
Also the Jackson issue is probably related to Java 11 or above. Stay with Java 8 for now.
Fix by setting the sbt classLoaderLayeringStrategy:
run / classLoaderLayeringStrategy := ClassLoaderLayeringStrategy.Flat
sbt uses a new classloader for the application that is run with run. This causes other classes already loaded by the JVM (Predef for instance) to be reused, reducing startup time. See in-process classloaders for details.
This doesn't play well with the Beam DataflowRunner because it explicitly does not stage classes from parent classloaders, see PipelineResources.java#L51:
Attempts to detect all the resources the class loader has access to. This does not recurse to class loader parents stopping it from pulling in resources from the system class loader.
So the fix is to force all classes used by your application to be loaded in the same classloader so that DataflowRunner stages everything.
Hope that helps

How to re-use compiled sources in different machines

To speed up our development workflow we split the tests and run each part on multiple agents in parallel. However, compiling test sources seem to take most of the time for the testing steps.
To avoid this, we pre-compile the tests using sbt test:compile and build a docker image with compiled targets.
Later, this image is used in each agent to run the tests. However, it seems to recompile the tests and application sources even though the compiled classes exists.
Is there a way to make sbt use existing compiled targets?
Update: To give more context
The question strictly relates to scala and sbt (hence the sbt tag).
Our CI process is broken down in to multiple phases. Its roughly something like this.
stage 1: Use SBT to compile Scala project into java bitecode using sbt compile We compile the test sources in the same test using sbt test:compile The targes are bundled in a docker image and pushed to the remote repository,
stage 2: We use multiple agents to split and run tests in parallel.
The tests run from the built docker image, so the environment is the
same. However, running sbt test causes the project to recompile even
through the compiled bitecode exists.
To make this clear, I basically want to compile on one machine and run the compiled test sources in another without re-compiling
Update
I don't think https://stackoverflow.com/a/37440714/8261 is the same problem because unlike it, I don't mount volumes or build on the host machine. Everything is compiled and run within docker but in two build stages. The file modified times and paths are retained the same because of this.
The debug output has something like this
Initial source changes:
removed:Set()
added: Set()
modified: Set()
Invalidated products: Set(/app/target/scala-2.12/classes/Class1.class, /app/target/scala-2.12/classes/graph/Class2.class, ...)
External API changes: API Changes: Set()
Modified binary dependencies: Set()
Initial directly invalidated classes: Set()
Sources indirectly invalidated by:
product: Set(/app/Class4.scala, /app/Class5.scala, ...)
binary dep: Set()
external source: Set()
All initially invalidated classes: Set()
All initially invalidated sources:Set(/app/Class4.scala, /app/Class5.scala, ...)
Recompiling all 304 sources: invalidated sources (266) exceeded 50.0% of all sources
Compiling 302 Scala sources and 2 Java sources to /app/target/scala-2.12/classes ...
It has no Initial source changes, but products are invalidated.
Update: Minimal project to reproduce
I created a minimal sbt project to reproduce the issue.
https://github.com/pulasthibandara/sbt-docker-recomplile
As you can see, nothing changes between the build stages, other than running in the second stage in a new step (new container).
While https://stackoverflow.com/a/37440714/8261 pointed at the right direction, the underlying issue and the solution for this was different.
Issue
SBT seems to recompile everything when it's run on different stages of a docker build. This is because docker compresses images created in each stage, which strips out the millisecond portion of the lastModifiedDate from sources.
SBT depends on lastModifiedDate when determining if sources have changed, and since its different (the milliseconds part) the build triggers a full recompilation.
Solution
Java 8:
Setting -Dsbt.io.jdktimestamps=true when running SBT as recommended in https://github.com/sbt/sbt/issues/4168#issuecomment-417655678 to workaround this issue.
Newer:
Follow recomendation in https://github.com/sbt/sbt/issues/4168#issuecomment-417658294
I solved the issue by setting SBT_OPTS env variable in the docker file like
ENV SBT_OPTS="${SBT_OPTS} -Dsbt.io.jdktimestamps=true"
The test project has been updated with this workaround.
Using SBT:
I think there is already an answer to this here: https://stackoverflow.com/a/37440714/8261
It looks tricky to get exactly right. Good luck!
Avoiding SBT:
If the above approach is too difficult (i.e. getting sbt test to consider that your test classes do not need re-compiling), you could instead avoid using sbt but instead run your test suite using java directly.
If you can get sbt to log the java command that it is using to run your test suite (e.g. using debug logging), then you could run that command on your test runner agents directly, which would completely preclude sbt re-compiling things.
(You might need to write the java command into a script file, if the classpath is too long to pass as a command-line argument in your shell. I have previously had to do that for a large project.)
This would be a much hackier approach that the one above, but might be quicker to get working.
A possible solution might be defining your own sbt task without dependencies or try to change the test task. For example you could create a task to run a JUnit runner if that was your testing framework. To define a task see this on Implementing Tasks.
You could even go as far as compiling sending the code and running the remotes from the same task as it is any scala code you want. From the sbt reference manual
You could be defining your own task, or you could be planning to redefine an existing task. Either way looks the same; use := to associate some code with the task key

Scala import not working - object <name> is not a member of package, sbt preppends current package namespace in imports

I have an issue when trying to import in scala. The object Database exists under com.me.project.database but when I try to import it:
import com.me.project.database.Database
I get the error:
object Database is not a member of package com.me.project.controllers.com.me.project.database
Any ideas what the problem is?
Edit:
It is worth mentioning that the import is in the file Application.scala under the package com.me.project.controllers, I can't figure out why it would append the import to the current package though, weird...
Edit 2:
So using:
import _root_.com.me.project.database.Database
Does work as mentioned below. But should it work without the _root_? The comments so far seem to indicate that it should.
Answer:
So it turns out that I just needed to clean the project for the import to work properly, using both:
import _root_.com.me.project.database.Database
import com.me.project.database.Database
are valid solutions. Eclipse had just gotten confused.
imports can be relative. Is that the only import you have? be careful with other imports like
import com.me
ultimately, this should fix it, then you can try to find more about it:
import _root_.com.me.project.database.Database
In my case I also needed to check that object which is not found as a member of package is compiled successfully.
I realize this question already has an accepted answer, but since I experienced the same problem but with a different cause I figured I'd add an answer.
I had a bunch of interdependent projects which suddenly needed a root import in order to compile. It turned out that I had duplicated the package declaration in a single file. This caused some kind of chain reaction and made it very hard to find the source of the problem.
In summary I had
package foo.bar
package foo.bar
on the top of the file instead of just
package foo.bar
Hope this saves someone some really tedious error hunting.
In my case I had to run sbt clean.
I had faced similar issue where IntelliJ showed error on importing one file from the same project.
What did not resolve the issue in my case:
adding _root_ in import statement
sbt clean
restarting machine
What actually resolved the issue:
main menu => select File => click on Invalidate Caches / Restart => pop-up dailog => click on invalidate the caches and restart.
I was using IDEA (2019.2.2 Ultimate Edition) on macOs mojave 10.14.6
Java -> Scala conversion without cleaning
Don't forget to clean if you convert some file in a project from Java to Scala. I had a continuous integration build running where I couldn't get things to work, even though the build was working locally, after I had converted a Java class into a Scala object. Solution: add 'clean' to the build procedure on the CI server. The name of the generated .class file in Scala is slightly different than for a Java class, I believe, so this is very likely what was causing the issue.
If you are using gradle as your build tool, then ensure that jar task is not disabled.
I had multiple modules in my project, where one module was dependent on a few other modules. However, I had disabled jar task in build.gradle:
jar {
enabled = false
}
That caused it to fail to resolve classes in the dependent modules and fail with the above error.
I will share my story, just in case it may help someone.
Scenario: intellij compilation succeeds, but gradle build fails on import com.foo.Bar, where Bar is a scala class.
TLDR reason: Bar was located under src/main/java/... as opposed to src/main/scala/...
Actual reason: Bar was not being compiled by compileScala gradle task (from gradle scala plugin) because it looks for scala sources only under src/<sourceSet>/scala.
From docs.gradle.org:
All the Scala source directories can contain Scala and Java code. The
Java source directories may only contain Java source code.
Hope this helps
I had a similar problem but none of the solutions here worked for me. What did work however was a simple restart of my machine.
Perhaps it was something with my Intellij but after a quick restart, everything seems to be working fine.
I had a similar situation, which was failing in both IntelliJ and maven on the command line. I went to apply the suggested temp fix (adding _root_) but intellij was glitching so bad that wasn't even possible.
Eventually I noticed that I had mis-created a package so that it repeated the whole path of the package. That meant that the directory my class was in had a subfolder called "com", and the start of my file looked like:
package com.mycompany.mydept.myproject.myfunctionality.sub1
import com.holdenkarau.spark.testing.DataFrameSuiteBase
where I had another package called
com.mycompany.mydept.myproject.myfunctionality.sub1.com.mycompany.mydept.myproject.myfunctionality.sub2
And the compiler was looking for "holdenkarau" under com.mycompany.mydept.myproject.myfunctionality.com and failing.
I had this issue while using Intellij and the built-in sbt shell (precisely, I was trying to run the command console, which invokes a compiler check of the code).
In my case, after trying the other suggested solutions on this thread, I found that I could restart the sbt shell and it would go away. There's a button on the left-hand side of a looped green arrow and a small grey square which does this in one click (obviously, this is subject to Jet Brains not changing the design of the IDE!!!).
I hope this helps some people get past this issue quickly.
In my case, In Intellij, Just renaming the package file to something else >> see if it updates the import statements >> run the code >> then renaming back to the original name worked.

Converting CruiseControl to Hudson

I am currently writing a Perl script that will convert CruiseControl config.xml files to a Hudson config.xml for each project. However I am stuck at one key part: How do I make it so the sub modules of a project also get the goals from there CC config?
I can do the root module fine, and set up the configurations fine as well. I just need a way to configure Hudson to add the sub modules, copy the goals from a file, import the goals, then run the build for the module. The way I am thinking right now is that I could either:
Make a Perl script that runs before the build or
Make a groovy script that integrates with Hudson and have it manually do these steps.
Side Note: If anyone is interested about using this script I would be willing to publish it once it is done.
So I beleive I figured out my own problem. Essentially what I am going to do is have every module set to clean, I will add a text file with what module has what goals then add the goals section to the config.xml.Then reload hudson from the disk so it can add the goals then re run the job with the proper goals for the sub job. I did this via perl.