Modifying and Building Spark core - scala

I am trying to make a modification to the Apache Spark source code. I created a new method and added it to the RDD.scala file within the Spark source code I downloaded. After making the modification to RDD.scala, I built Spark using
mvn -Dhadoop.version=2.2.0 -DskipTests clean package
I then created a sample Scala Spark Application as mentioned here
I tried using the new function I created, and I got a compilation error when using sbt to create a jar for Spark. How exactly do I compile Spark with my modification and attach the modified jar to my project? The file I modified is RDD.scala within the core project. I run sbt package from the root dir of my Spark Application Project.
Here is the sbt file:
name := "N Spark"
version := "1.0"
scalaVersion := "2.11.6"
libraryDependencies += "org.apache.spark" % "spark-core_2.11" % "1.3.0"
Here is the error:
sbt package
[info] Loading global plugins from /Users/Raggy/.sbt/0.13/plugins
[info] Set current project to Noah Spark (in build file:/Users/r/Downloads/spark-proj/n-spark/)
[info] Updating {file:/Users/r/Downloads/spark-proj/n-spark/}n-spark...
[info] Resolving jline#jline;2.12.1 ...
[info] Done updating.
[info] Compiling 1 Scala source to /Users/r/Downloads/spark-proj/n-spark/target/scala-2.11/classes...
[error] /Users/r/Downloads/spark-proj/n-spark/src/main/scala/SimpleApp.scala:11: value reducePrime is not a member of org.apache.spark.rdd.RDD[Int]
[error] logData.reducePrime(_+_);
[error] ^
[error] one error found
[error] (compile:compileIncremental) Compilation failed
[error] Total time: 24 s, completed Apr 11, 2015 2:24:03 AM
UPDATE
Here is the updated sbt file
name := "N Spark"
version := "1.0"
scalaVersion := "2.10"
libraryDependencies += "org.apache.spark" % "1.3.0"
I get the following error for this file:
[info] Loading global plugins from /Users/Raggy/.sbt/0.13/plugins
/Users/Raggy/Downloads/spark-proj/noah-spark/simple.sbt:7: error: No implicit for Append.Value[Seq[sbt.ModuleID], sbt.impl.GroupArtifactID] found,
so sbt.impl.GroupArtifactID cannot be appended to Seq[sbt.ModuleID]
libraryDependencies += "org.apache.spark" % "1.3.0"

Delete libraryDependencies from build.sbt and just copy the custom-built Spark jar to the lib directory in your application project.

Related

Unable to import locally published Scala plugin

I have a project which I publish locally to my .m2 directory as a plugin, which later I need to import into a different Scala project and use it.
It seems like the publishing step is executed correctly.
The build.sbt file of the plugin project looks like this:
lazy val root = (project in file("."))
.enablePlugins(SbtPlugin)
.settings(
name := "pluginName",
organization := "com.myOrg",
pluginCrossBuild / sbtVersion := {
scalaBinaryVersion.value match {
case "2.12" => "1.4.6" // set minimum sbt version
}
}
)
resolvers += "confluent" at "https://packages.confluent.io/maven"
libraryDependencies ++= Seq(
"io.confluent" % "kafka-schema-registry-client" % "7.0.1"
// some other dependemcies
)
After running the compile and publishLocal commands in sbt shell I get the next message:
[info] delivering ivy file to /Users/me/Work/repos/external/pluginName/target/scala-2.12/sbt-1.0/ivy-1.0.0.xml
[info] published pluginName to /Users/me/.ivy2/local/com.myOrg/pluginName/scala_2.12/sbt_1.0/1.0.0/poms/pluginName.pom
[info] published pluginName to /Users/me/.ivy2/local/com.myOrg/pluginName/scala_2.12/sbt_1.0/1.0.0/jars/pluginName.jar
[info] published pluginName to /Users/me/.ivy2/local/com.myOrg/pluginName/scala_2.12/sbt_1.0/1.0.0/srcs/pluginName-sources.jar
[info] published pluginName to /Users/me/.ivy2/local/com.myOrg/pluginName/scala_2.12/sbt_1.0/1.0.0/docs/pluginName-javadoc.jar
[info] published ivy to /Users/me/.ivy2/local/com.myOrg/pluginName/scala_2.12/sbt_1.0/1.0.0/ivys/ivy.xml
[success] Total time: 0 s, completed 3 Jan 2022, 10:07:43
In order to import/install this plugin in the other Scala project, I have added the next line to the plugins.sbt file: addSbtPlugin("com.myOrg" % "pluginName" % "1.0.0")
I also added libs-release-local and libs-snapshot-local to the externalResolvers section in the buid.sbt file.
After reloading and compiling the project I received this error:
[error] (update) sbt.librarymanagement.ResolveException: Error downloading io.confluent:kafka-schema-registry-client:7.0.1
[error] Not found
[error] Not found
[error] not found: https://repo1.maven.org/maven2/io/confluent/kafka-schema-registry-client/7.0.1/kafka-schema-registry-client-7.0.1.pom
[error] not found: /Users/me/.ivy2/local/io.confluent/kafka-schema-registry-client/7.0.1/ivys/ivy.xml
[error] not found: https://repo.scala-sbt.org/scalasbt/sbt-plugin-releases/io.confluent/kafka-schema-registry-client/7.0.1/ivys/ivy.xml
[error] not found: https://repo.typesafe.com/typesafe/ivy-releases/io.confluent/kafka-schema-registry-client/7.0.1/ivys/ivy.xml
I am kind of new to Scala and I don't understand what and I doing wrong.
Can anyone shed some light on this problem?
You're publishing to your local Maven cache, but sbt uses Ivy.
Try removing the publishTo setting, it shouldn't be needed. Just use the publishLocal task to publish to your local Ivy cache.

package name not observed by sbt

I compiled a small Scala example of Spark program called AverageAgeByName.scala
Here is the build.sbt:
$ cd /opt/LearningSparkV2-master/chapter1/main/scala/chapter3
$ vim build.sbt
// Name of the package
name := "main/scala/chapter3"
// Version of our package
version := "1.0"
// Version of Scala
scalaVersion := "2.12.14"
// Spark library dependencies
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.1.2",
"org.apache.spark" %% "spark-sql" % "3.1.2"
)
I ran the command:
$ sbt clean package
[info] Updated file /opt/LearningSparkV2 master/chapter1/main/scala/chapter3/project/build.properties: set sbt.version to 1.5.5
[info] welcome to sbt 1.5.5 (Oracle Corporation Java 1.8.0_242)
[info] loading project definition from /opt/LearningSparkV2-master/chapter1/main/scala/chapter3/project
[info] loading settings for project chapter3 from build.sbt ...
[info] set current project to main/scala/chapter3 (in build file:/opt/LearningSparkV2-master/chapter1/main/scala/chapter3/)
[success] Total time: 0 s, completed Aug 4, 2021 10:07:27 AM
[info] compiling 1 Scala source to /opt/LearningSparkV2-master/chapter1/main/scala/chapter3/target/scala-2.12/classes ...
[success] Total time: 9 s, completed Aug 4, 2021 10:07:36 AM
The resulting class was not placed in proper directory hierarchy according to the package name in build.sbt:
$ ls /opt/LearningSparkV2-master/chapter1/main/scala/chapter3/target/scala-2.12/classes
'AverageAgeByName$$typecreator1$1.class' 'AverageAgeByName$.class' AverageAgeByName.class
It's flat. I expect the class should be placed in /opt/LearningSparkV2-master/chapter1/main/scala/chapter3/target/scala-2.12/classes/main/scala/chapter3
Where did I get it wrong?
The name := "main/scala/chapter3" in the build.sbt has nothing to do with a package or a destination folder: it's the name of the project which will be used when packaging your project as a JAR for instance.
The folder in which the classes are generated is driven by the package you set in your Scala file AverageAgeByName.scala.
The following would generate class file in target/scala-2/classes/xxx/yyy:
package xxx.yyy
class AverageAgeByName {}
Also note that usually source files are also put in a directory structure matching the package you set. That is in src/main/scala/xxx/yyy in the example above.
And last, you should probably not care at all of where the class files are generated.

Attempting to execute compile task but mystery module can't be loaded

I'm compiling a multi-part Scala project. It's not that large, but some of it is Scala 2.13 and some is Scala 3.
Attempting to compile generates the fatal error [UNRESOLVED DEPENDENCIES:
base#base_2.12;0.1.0-SNAPSHOT: not found]
The thing is, the string {0.1.0-SNAPSHOT} doesn't occur anywhere in my build.sbt or anywhere else. It used to be there, but it's long gone. I assume some update cache contains it, but I've been unable to find it.
Here is my {build.sbt}:
ThisBuild / libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.7" % Test
ThisBuild / Compile / scalacOptions ++= Seq("--deprecation")
ThisBuild / Test / logBuffered := false
ThisBuild / Test / parallelExecution := false
lazy val scala213 = "2.13.5"
lazy val scala212 = "2.12.13"
lazy val scala3 = "3.0.0-RC2"
lazy val supportedScalaVersions = List(scala213, scala3)
lazy val root = (project in file("."))
.aggregate(top, trans, base)
.settings(
name := "toysat"
)
lazy val top = (project in file("top"))
.settings(
name := "main",
scalaVersion := scala213,
scalacOptions += "-Ytasty-reader",
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.7" % Test
)
.dependsOn(trans, base)
lazy val trans = (project in file("trans"))
.settings(
name := "trans",
Compile / scalaVersion := scala3,
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.7" % Test
).
dependsOn(base)
lazy val base = (project in file("base"))
.settings(
name := "base",
scalaVersion := scala213,
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.7" % Test
Most questions of this ilk on stackoverflow are about downloading remotely defined modules. The problem I'm having is that sbt cannot find an obsolete version of one of my (freshly compiled) modules.
and here is the sbt command output (this is an Emacs buffer):
sbt:toysat> reload
[info] welcome to sbt 1.5.5 (AdoptOpenJDK Java 1.8.0_292)
[info] loading project definition from /Users/drewmcdermott/BIG/RESEARCH/puzzles/toystory4/toysat/project
[info] loading settings for project root from build.sbt ...
[info] set current project to toysat (in build file:/Users/drewmcdermott/BIG/RESEARCH/puzzles/toystory4/toysat/)
sbt:toysat> compile
[info] compiling 4 Scala sources to /Users/drewmcdermott/BIG/RESEARCH/puzzles/toystory4/toysat/base/target/scala-2.13/classes ...
[warn]
[warn] Note: Unresolved dependencies path:
[info] done compiling
[error] stack trace is suppressed; run last trans / update for the full output
[error] (trans / update) sbt.librarymanagement.ResolveException: Error downloading base:base_2.12:0.1.0-SNAPSHOT
[error] Not found
[error] Not found
[error] not found: /Users/drewmcdermott/.ivy2/localbase/base_2.12/0.1.0-SNAPSHOT/ivys/ivy.xml
[error] not found: https://repo1.maven.org/maven2/base/base_2.12/0.1.0-SNAPSHOT/base_2.12-0.1.0-SNAPSHOT.pom
[error] Total time: 25 s, completed Jul 28, 2021 11:06:18 PM
The 25 seconds were consumed compiling the 4 files in the base subproject, apparently successfully. I think it's when sbt tries to compile the trans subproject that it runs into trouble.
Here's a partial stack trace. It means nothing to me except that Coursier is involved.
sbt:toysat> last trans / update
[debug] not up to date. inChanged = true, force = false
[debug] Updating trans...
[warn]
[warn] Note: Unresolved dependencies path:
[error] sbt.librarymanagement.ResolveException: Error downloading base:base_2.12:0.1.0-SNAPSHOT
[error] Not found
[error] Not found
[error] not found: /Users/drewmcdermott/.ivy2/localbase/base_2.12/0.1.0-SNAPSHOT/ivys/ivy.xml
[error] not found: https://repo1.maven.org/maven2/base/base_2.12/0.1.0-SNAPSHOT/base_2.12-0.1.0-SNAPSHOT.pom
[error] at lmcoursier.CoursierDependencyResolution.unresolvedWarningOrThrow(CoursierDependencyResolution.scala:258)
[error] at lmcoursier.CoursierDependencyResolution.$anonfun$update$38(CoursierDependencyResolution.scala:227)
[error] at lmcoursier.CoursierDependencyResolution$$Lambda$4262/0x0000000000000000.apply(Unknown Source)
[error] at scala.util.Either$LeftProjection.map(Either.scala:573)
[error] at lmcoursier.CoursierDependencyResolution.update(CoursierDependencyResolution.scala:227)
[error] at sbt.librarymanagement.DependencyResolution.update(DependencyResolution.scala:60)
[error] at sbt.internal.LibraryManagement$.resolve$1(LibraryManagement.scala:59)
It seems clear that some cache somewhere is holding onto the string 0.1.0-SNAPSHOT, but there are an ungodly number of caches. I've tried deleting several, but I haven't found the relevant one.
Can someone explain how to recover from a situation like this?
Your base project is only compiled for Scala 2.13 whereas it is defined as a dependency (using dependsOn) of trans which targets Scala 3.
You should cross-build your base project for Scala 2.13 and 3 (and maybe 2.12 according to your error message even though I don't see any use of Scala 2.12 in what you shared).
Edit: Scala 2.13 and 3 are compatible, so the issue should only happen if a dependency is built only for 2.12.
I am not answering my own question because I'm a narcissist, but because I can't say what I want in a comment. Plus editing the original question would bury possibly useful information in an odd place. I've upvoted and approved #GaelJ's answer.
My build.sbt doesn't look that different. The differences may be encapsulated by showing the revised trans subproject definition:
lazy val trans = (projectMatrix in file("trans"))
.settings(
name := "trans",
version := "0.3",
// I thought this line was unnecessary, but without
// it sbt doesn't understand the command trans / compile --
Compile / scalaVersion := scala3,
libraryDependencies += "org.scalatest" %% "scalatest" % "3.2.7" % Test
)
.jvmPlatform(scalaVersions = Seq(scala213))
.dependsOn(base)

Scala IntelliJ library import errors

I am new to scala and I am trying to import the following libraries in my build.sbt. When IntelliJ does an auto-update I get the following error:
Error while importing sbt project:
List([info] welcome to sbt 1.3.13 (Oracle Corporation Java 1.8.0_251)
[info] loading global plugins from C:\Users\diego\.sbt\1.0\plugins
[info] loading project definition from C:\Users\diego\development\Meetup\Stream-Processing\project
[info] loading settings for project stream-processing from build.sbt ...
[info] set current project to Stream-Processing (in build file:/C:/Users/diego/development/Meetup/Stream-Processing/)
[info] sbt server started at local:sbt-server-80d70f9339b81b4d026a
sbt:Stream-Processing>
[info] Defining Global / sbtStructureOptions, Global / sbtStructureOutputFile and 1 others.
[info] The new values will be used by cleanKeepGlobs
[info] Run `last` for details.
[info] Reapplying settings...
[info] set current project to Stream-Processing (in build file:/C:/Users/diego/development/Meetup/Stream-Processing/)
[info] Applying State transformations org.jetbrains.sbt.CreateTasks from C:/Users/diego/.IntelliJIdea2019.3/config/plugins/Scala/repo/org.jetbrains/sbt-structure-extractor/scala_2.12/sbt_1.0/2018.2.1+4-88400d3f/jars/sbt-structure-extractor.jar
[info] Reapplying settings...
[info] set current project to Stream-Processing (in build file:/C:/Users/diego/development/Meetup/Stream-Processing/)
[warn]
[warn] Note: Unresolved dependencies path:
[error] stack trace is suppressed; run 'last update' for the full output
[error] stack trace is suppressed; run 'last ssExtractDependencies' for the full output
[error] (update) sbt.librarymanagement.ResolveException: Error downloading org.apache.kafka:kafka-clients_2.11:2.3.1
[error] Not found
[error] Not found
[error] not found: C:\Users\diego\.ivy2\local\org.apache.kafka\kafka-clients_2.11\2.3.1\ivys\ivy.xml
[error] not found: https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients_2.11/2.3.1/kafka-clients_2.11-2.3.1.pom
[error] (ssExtractDependencies) sbt.librarymanagement.ResolveException: Error downloading org.apache.kafka:kafka-clients_2.11:2.3.1
[error] Not found
[error] Not found
[error] not found: C:\Users\diego\.ivy2\local\org.apache.kafka\kafka-clients_2.11\2.3.1\ivys\ivy.xml
[error] not found: https://repo1.maven.org/maven2/org/apache/kafka/kafka-clients_2.11/2.3.1/kafka-clients_2.11-2.3.1.pom
[error] Total time: 2 s, completed Jun 28, 2020 12:11:24 PM
[info] shutting down sbt server)
This is my build.sbt file:
name := "Stream-Processing"
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.4"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10_2.12
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.4.4"
// https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
libraryDependencies += "org.apache.kafka" %% "kafka-clients" % "2.3.1"
// https://mvnrepository.com/artifact/mysql/mysql-connector-java
libraryDependencies += "mysql" % "mysql-connector-java" % "8.0.18"
// https://mvnrepository.com/artifact/org.mongodb.spark/mongo-spark-connector
libraryDependencies += "org.mongodb.spark" %% "mongo-spark-connector" % "2.4.1"
I made a Scala project just to make sure Spark works and my python project using Kafka works as well so I am sure it's not a spark/kafka problem. Any reason why I am getting that error?
Try removing one % before "kafka-clients":
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "2.3.1"
The semantics of %% in SBT is that it appends the Scala version being used to the artifact name, so it becomes org.apache.kafka:kafka-clients_2.11:2.3.1 as the error message shows as well. Note the _2.11 suffix.
This is a nice shorthand for Scala libraries, but can get confusing for beginners, when used with Java libs.

Scala REPL import SBT file

I have an SBT file that has the following contents:
name := "Scala Playground"
version := "1.0"
scalaVersion := "2.11.6"
resolvers += "Typesafe Repo" at "http://repo.typesafe.com/typesafe/releases/"
libraryDependencies ++= Seq(
"com.netflix.rxjava" %% "rxjava-scala" % "0.19.1",
"com.typesafe.play" %% "play-json" % "2.2.1"
)
Saved as scala-playground.sbt. I want to use this in my Scala REPL. When I tried to do the following:
sbt scala-playground.sbt
I got the following error:
[info] Set current project to Scala Playground (in build file:/home/joe/Desktop/)
[error] Not a valid command: scala-playground
[error] Not a valid project ID: scala-playground
[error] Expected ':' (if selecting a configuration)
[error] Not a valid key: scala-playground (similar: scala-version, scalac-options, scala-binary-version)
[error] scala-playground
[error] ^
I can't see anything stupid in my sbt file. Could anyone throw some light on it? Is this a proper way to get dependencies inside my Scala REPL?
All I want to do is to get in some dependencies inside my Scala REPL, so that I can quickly run and evaluate certain libraries.
The command line arguments are sbt commands, not the file you want to use. Just go to the directory with scala-playground.sbt file and run from there:
sbt console
sbt should automatically load the scala-playground.sbt file from current directory and open Scala console.