Writing a file to Amazon S3 - scala

I am trying the following:
import awscala._, s3._
implicit val s3 = S3()
val bucket = s3.createBucket("acme-datascience-lab")
bucket.put("sample.txt", new java.io.File("sample.txt"))
I get the following error:
Exception in thread "main" java.lang.NoSuchFieldError: EU_CENTRAL_1
at awscala.Region0$.<init>(Region0.scala:27)
at awscala.Region0$.<clinit>(Region0.scala)
at awscala.package$.<init>(package.scala:3)
at awscala.package$.<clinit>(package.scala)
at awscala.s3.S3$.apply$default$2(S3.scala:18)
at com.acme.spark.FlightDelays.HistoricalFlightDelayOutput$.main(HistoricalFlightDelayOutput.scala:164)
at com.acme.spark.FlightDelays.HistoricalFlightDelayOutput.main(HistoricalFlightDelayOutput.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
It occurs on this line of code:
implicit val s3 = S3()
Here is the contents of my build.sbt file:
import AssemblyKeys._
assemblySettings
name := "acme-get-flight-delays"
version := "0.0.1"
scalaVersion := "2.10.5"
// additional libraries
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "1.6.0" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.0",
"org.apache.spark" %% "spark-hive" % "1.6.0",
"org.scalanlp" %% "breeze" % "0.11.2",
"org.scalanlp" %% "breeze-natives" % "0.11.2",
"net.liftweb" %% "lift-json" % "2.5+",
"org.apache.hadoop" % "hadoop-client" % "2.6.0",
"org.apache.hadoop" % "hadoop-aws" % "2.6.0",
"com.amazonaws" % "aws-java-sdk" % "1.0.002",
"com.github.seratch" %% "awscala" % "0.5.+"
)
resolvers ++= Seq(
"Spark Packages Repo" at "http://dl.bintray.com/spark-packages/maven",
"JBoss Repository" at "http://repository.jboss.org/nexus/content/repositories/releases/",
"Spray Repository" at "http://repo.spray.cc/",
"Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/",
"Akka Repository" at "http://repo.akka.io/releases/",
"Twitter4J Repository" at "http://twitter4j.org/maven2/",
"Apache HBase" at "https://repository.apache.org/content/repositories/releases",
"Twitter Maven Repo" at "http://maven.twttr.com/",
"scala-tools" at "https://oss.sonatype.org/content/groups/scala-tools",
"Typesafe repository" at "http://repo.typesafe.com/typesafe/releases/",
"Second Typesafe repo" at "http://repo.typesafe.com/typesafe/maven-releases/",
"Mesosphere Public Repository" at "http://downloads.mesosphere.io/maven",
Resolver.sonatypeRepo("public")
)
mergeStrategy in assembly <<= (mergeStrategy in assembly) { (old) =>
{
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.startsWith("META-INF") => MergeStrategy.discard
case PathList("javax", "servlet", xs # _*) => MergeStrategy.first
case PathList("org", "apache", xs # _*) => MergeStrategy.first
case PathList("org", "jboss", xs # _*) => MergeStrategy.first
case "about.html" => MergeStrategy.rename
case "reference.conf" => MergeStrategy.concat
case _ => MergeStrategy.first
}
}
// Configure JAR used with the assembly plug-in
jarName in assembly := "acme-get-flight-delays.jar"
// A special option to exclude Scala itself from our assembly JAR, since Spark
// already bundles in Scala.
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)

You can use the AWS SDK for Java directly or a wrapper like AWScala that lets you do something like
import awscala._, s3._
implicit val s3 = S3()
val bucket = s3.bucket("your-bucket")
bucket.put("sample.txt", new java.io.File("sample.txt"))

Add the following to build.sbt:
libraryDependencies += "com.github.seratch" %% "awscala" % "0.3.+"
The following code will do the job:
implicit val s3 = S3()
s3.setRegion(com.amazonaws.regions.Region.getRegion(com.amazonaws.regions.Regions.US_EAST_1))
val bucket = s3.bucket("acme-datascience-lab")
bucket.get.put("sample.txt", new java.io.File("sample.txt"))

Related

Can't shade jars by shade plugin of SBT

I have many Deduplicate found... error when build project with SBT :
[error] Deduplicate found different file contents in the following:
[error] Jar name = netty-all-4.1.68.Final.jar, jar org = io.netty, entry target = io/netty/handler/ssl/SslProvider.class
[error] Jar name = netty-handler-4.1.50.Final.jar, jar org = io.netty, entry target = io/netty/handler/ssl/SslProvider.class
...
For now I consider the option with shading all libraries (as here):
libraryDependencies ++= Seq(
"com.rometools" % "rome" % "1.18.0",
"com.typesafe.scala-logging" %% "scala-logging" % "3.9.5", // log
"ch.qos.logback" % "logback-classic" % "1.4.5", // log
"com.lihaoyi" %% "upickle" % "1.6.0", // file-io
"net.liftweb" %% "lift-json" % "3.5.0", // json
"org.apache.spark" %% "spark-sql" % "3.2.2", // spark
"org.apache.spark" %% "spark-core" % "3.2.2" % "provided", // spark
"org.postgresql" % "postgresql" % "42.5.1", // spark + postgresql
)
So that I added the following shade-rules:
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.lihaoyi.**" -> "crdaa.#1")
.inLibrary("com.lihaoyi" %% "upickle" % "1.6.0")
.inProject,
ShadeRule.rename("ch.qos.logback.**" -> "crdbb.#1")
.inLibrary("ch.qos.logback" % "logback-classic" % "1.4.5")
.inProject,
ShadeRule.rename("com.typesafe.**" -> "crdcc.#1")
.inLibrary("com.typesafe.scala-logging" %% "scala-logging" % "3.9.5")
.inProject,
ShadeRule.rename("org.apache.spark.spark-sql.**" -> "crddd.#1")
.inLibrary("org.apache.spark" %% "spark-sql" % "3.2.2")
.inProject,
ShadeRule.rename("org.apache.spark.spark-core.**" -> "crdee.#1")
.inLibrary("org.apache.spark" %% "spark-core" % "3.2.2")
.inProject,
ShadeRule.rename("com.rometools.**" -> "crdff.#1")
.inLibrary("com.rometools" % "rome" % "1.18.0")
.inProject,
ShadeRule.rename("org.postgresql.postgresql.**" -> "crdgg.#1")
.inLibrary("org.postgresql" % "postgresql" % "42.5.1")
.inProject,
ShadeRule.rename("net.liftweb.**" -> "crdhh.#1")
.inLibrary("net.liftweb" %% "lift-json" % "3.5.0")
.inProject,
)
But after reloading SBT when I start assembly I got the same errors with duplicates.
What can be problem here?
PS:
ThisBuild / scalaVersion := "2.13.10"
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "2.1.0")
Update
Finally I ditched the rename in favor of unmanagedJars + not including spark dependencies (most of the errors were caused by them) by setting provided option .
After that only Deduplicate-errors with module-info.class remains, but its solution (by changing merging strategy) is described in sbt-assembly-doc.
That is, I downloaded spark separately, copied their jars into ./jarlib directory (!!! not in ./lib directory), changed the following in build conf:
libraryDependencies ++= Seq(
//...
"org.apache.spark" %% "spark-sql" % "3.2.3" % "provided",
"org.apache.spark" %% "spark-core" % "3.2.3" % "provided",
)
unmanagedJars in Compile += file("./jarlib")
ThisBuild / assemblyMergeStrategy := {
case PathList("module-info.class") => MergeStrategy.discard
case x if x.endsWith("/module-info.class") => MergeStrategy.discard
case x =>
val oldStrategy = (ThisBuild / assemblyMergeStrategy).value
oldStrategy(x)
}
Spark-jars have been included in final jar
Update 2
As noted in comments unmanagedJars are useless in that case - so I removed unmanagedJars string from build.sbt
Noted Spark-jars which aren't included in final jar-file should be in class-path when you start jar.
In my case I copied Spark-jars + final jar to folder ./app and start jar by:
java -cp "./app/*" main.Main
... where main.Main is main-class.
Sometimes like this (put in your build.sbt) is how you typically remove the deduplication that comes when your libraries have overlapping libraries of their own:
assemblyMergeStrategy in assembly := {
case PathList("javax", "activation", _*) => MergeStrategy.first
case PathList("com", "sun", _*) => MergeStrategy.first
case "META-INF/io.netty.versions.properties" => MergeStrategy.first
case "META-INF/mime.types" => MergeStrategy.first
case "META-INF/mailcap.default" => MergeStrategy.first
case "META-INF/mimetypes.default" => MergeStrategy.first
case d if d.endsWith(".jar:module-info.class") => MergeStrategy.first
case d if d.endsWith("module-info.class") => MergeStrategy.first
case d if d.endsWith("/MatchersBinder.class") => MergeStrategy.discard
case d if d.endsWith("/ArgumentsProcessor.class") => MergeStrategy.discard
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}

ScalaSpark: unable to create dataframe with scala-client dependency

I need to support couchbase version 6 with spark 2.3 or 2.4 and scala version is 2.11.12. I am facing an issue while creating a data frame.
SBT code snippet
scalaVersion := "2.11.12"
resolvers += "Couchbase Snapshots" at "http://files.couchbase.com/maven2"
val sparkVersion = "2.3.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" %% "spark-streaming" % sparkVersion,
"org.apache.spark" %% "spark-sql" % sparkVersion,
"com.couchbase.client" %% "spark-connector" % "2.3.0",
"com.couchbase.client" %% "scala-client" % "1.0.0-alpha.3")
Code
val spark = SparkSession
.builder()
.appName("Example")
.master("local[*]")
.config("spark.couchbase.nodes", "10.12.12.88") // connect to Couchbase Server on localhost
.config("spark.couchbase.username", "abcd") // with given credentials
.config("spark.couchbase.password", "abcd")
.config("spark.couchbase.bucket.beer-sample", "") // open the travel-sample bucket
.getOrCreate()
val sc = spark.sparkContext
import com.couchbase.spark.sql._
val sql = spark.sqlContext
val dataframe = sql.read.couchbase()
val result = dataframe.collect()
Exception
Caused by: java.lang.ClassNotFoundException: com.couchbase.client.core.message.CouchbaseRequest
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
Tried:
As per the suggestion added a dependency
"com.couchbase.client" % "core-io" % "1.7.6",
Without scala-client dependency I am able to get dataframe but with scala-client unable to fix. please suggest a solution for this problem
I have made the changes to your build.sbt file and have added settings for the sbt-assembly plugin.
scalaVersion := "2.11.12"
resolvers += "Couchbase Snapshots" at "http://files.couchbase.com/maven2"
val sparkVersion = "2.3.2"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % Provided,
"org.apache.spark" %% "spark-streaming" % sparkVersion % Provided,
"org.apache.spark" %% "spark-sql" % sparkVersion % Provided,
"com.couchbase.client" %% "spark-connector" % "2.3.0")
assemblyJarName in assembly := s"${name.value}-${version.value}.jar"
assemblyMergeStrategy in assembly := {
case m if m.toLowerCase.endsWith("manifest.mf") => MergeStrategy.discard
case m if m.toLowerCase.matches("meta-inf.*\\.sf$") => MergeStrategy.discard
case "reference.conf" => MergeStrategy.concat
case x: String if x.contains("UnusedStubClass.class") => MergeStrategy.first
case _ => MergeStrategy.first
}
You need to create a file called plugins.sbt in the directory called project and add the following line to it:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.10")
Once done, run the command sbt clean compile assembly in the projects' root directory. it should build your jar.

Scala Spark Streaming unit test with spark-testing-base throws error

I was trying to run a unit test on my spark streaming code with spark-testing-base. And I am having trouble running their sample-codes.
Here is the code snippet I copied
import com.holdenkarau.spark.testing.SharedSparkContext
import org.scalatest.FunSuite
class SampleTest extends FunSuite with SharedSparkContext {
test("test initializing spark context") {
val list = List(1, 2, 3, 4)
val rdd = sc.parallelize(list)
assert(rdd.count === list.length)
}
}
And here is the stacktrace of errors.
18/10/19 02:08:01 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
An exception or error caused a run to abort.
java.lang.ExceptionInInitializerError
at org.apache.spark.SparkContext.withScope(SparkContext.scala:701)
at org.apache.spark.SparkContext.parallelize(SparkContext.scala:718)
at com.myproject.SampleTest$$anonfun$1.apply(DStreamTransformSpec.scala:11)
at com.myproject.analytic.SampleTest$$anonfun$1.apply(DStreamTransformSpec.scala:9)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
I am including my build.sbt not sure if this help.
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.2.0"
libraryDependencies += "org.apache.bahir" %% "spark-streaming-pubsub" % "2.2.0"
libraryDependencies += "com.typesafe.play" %% "play-json" % "2.6.7"
libraryDependencies += "com.google.cloud" % "google-cloud-datastore" % "1.40.0"
// For test
libraryDependencies += "org.scalactic" %% "scalactic" % "3.0.5"
libraryDependencies += "org.scalatest" %% "scalatest" % "3.0.5" % "test"
libraryDependencies += "com.holdenkarau" %% "spark-testing-base" % "2.2.0_0.10.0" % "test"
fork in Test := true
javaOptions ++= Seq("-Xms512M", "-Xmx2048M", "-XX:MaxPermSize=2048M", "-XX:+CMSClassUnloadingEnabled")
assemblyMergeStrategy in assembly := {
case PathList("META-INF", xs # _*) => MergeStrategy.discard
case x => MergeStrategy.first
}
assemblyShadeRules in assembly := Seq(
ShadeRule.rename("com.google.common.**" -> "repackaged.com.google.common.#1").inAll,
ShadeRule.rename("com.google.protobuf.**" -> "repackaged.com.google.protobuf.#1").inAll
)
And is there any other tools, that are recommended to do testing with DStream?
Possibly you might have forgotten? I don't see it in build.sbt
parallelExecution in Test := false

SBT Assembly - Deduplicate

I got the following SBT files:
.
-- root
-- plugins.sbt
-- build.sbt
With plugins.sbt containing the following:
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.12.0")
And build.sbt containing the following:
import sbt.Keys._
resolvers in ThisBuild ++= Seq("Apache Development Snapshot Repository" at "https://repository.apache.org/content/repositories/snapshots/", Resolver.sonatypeRepo("public"))
name := "flink-experiment"
lazy val commonSettings = Seq(
organization := "my.organisation",
version := "0.1.0-SNAPSHOT"
)
val flinkVersion = "1.1.0"
val sparkVersion = "2.0.0"
val kafkaVersion = "0.8.2.1"
val hadoopDependencies = Seq(
"org.apache.avro" % "avro" % "1.7.7" % "provided",
"org.apache.avro" % "avro-mapred" % "1.7.7" % "provided"
)
val flinkDependencies = Seq(
"org.apache.flink" %% "flink-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-streaming-scala" % flinkVersion % "provided",
"org.apache.flink" %% "flink-connector-kafka-0.8" % flinkVersion exclude("org.apache.kafka", "kafka_${scala.binary.version}")
)
val sparkDependencies = Seq(
"org.apache.spark" %% "spark-core" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming" % sparkVersion % "provided",
"org.apache.spark" %% "spark-streaming-kafka-0-8" % sparkVersion exclude("org.apache.kafka", "kafka_${scala.binary.version}")
)
val kafkaDependencies = Seq(
"org.apache.kafka" %% "kafka" % "0.8.2.1"
)
val toolDependencies = Seq(
"com.github.scopt" %% "scopt" % "3.5.0"
)
val testDependencies = Seq(
"org.scalactic" %% "scalactic" % "2.2.6",
"org.scalatest" %% "scalatest" % "2.2.6" % "test"
)
lazy val root = (project in file(".")).
settings(commonSettings: _*).
settings(
libraryDependencies ++= hadoopDependencies,
libraryDependencies ++= flinkDependencies,
libraryDependencies ++= sparkDependencies,
libraryDependencies ++= kafkaDependencies,
libraryDependencies ++= toolDependencies,
libraryDependencies ++= testDependencies
).
enablePlugins(AssemblyPlugin)
run in Compile <<= Defaults.runTask(fullClasspath in Compile, mainClass in(Compile, run), runner in(Compile, run))
mainClass in assembly := Some("my.organization.experiment.Experiment")
assemblyOption in assembly := (assemblyOption in assembly).value.copy(includeScala = false)
Now sbt clean assembly sadly gives the following exception:
[error] (root/*:assembly) deduplicate: different file contents found in the following:
[error] /home/kevin/.ivy2/cache/org.apache.spark/spark-streaming-kafka-0-8_2.10/jars/spark-streaming-kafka-0-8_2.10-2.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/kevin/.ivy2/cache/org.apache.spark/spark-tags_2.10/jars/spark-tags_2.10-2.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
[error] /home/kevin/.ivy2/cache/org.spark-project.spark/unused/jars/unused-1.0.0.jar:org/apache/spark/unused/UnusedStubClass.class
How can I fix this?
https://github.com/sbt/sbt-assembly#excluding-jars-and-files
you can define assemblyMergeStrategy and probably discard ony file that you listed as they are all in 'unused' package.
You can override the default strategy for conflicts:
val defaultMergeStrategy: String => MergeStrategy = {
case x if Assembly.isConfigFile(x) =>
MergeStrategy.concat
case PathList(ps # _*) if Assembly.isReadme(ps.last) || Assembly.isLicenseFile(ps.last) =>
MergeStrategy.rename
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) =>
MergeStrategy.discard
case ps # (x :: xs) if ps.last.endsWith(".sf") || ps.last.endsWith(".dsa") =>
MergeStrategy.discard
case "plexus" :: xs =>
MergeStrategy.discard
case "services" :: xs =>
MergeStrategy.filterDistinctLines
case ("spring.schemas" :: Nil) | ("spring.handlers" :: Nil) =>
MergeStrategy.filterDistinctLines
case _ => MergeStrategy.deduplicate
}
case _ => MergeStrategy.deduplicate
}
as you can see assembly default strategy is MergeStrategy.deduplicate, you can add a new case case UnusedStubClass => MergeStrategy.first

scala.MatchError when creating fat jar using sbt assembly

I am trying to create a jar file for my project. I am using sbt assembly command to generate one.
But getting error when it starts merging files:
scala.MatchError:
spray\http\parser\ProtocolParameterRules$$anonfun$DeltaSeconds$1.class
(of class java.lang.String)
My build.sbt looks like this:
lazy val commonSettings = Seq(
name := "SampleSpray",
version := "1.0",
scalaVersion := "2.11.7",
organization := "com.test"
)
mainClass in assembly := Some("com.example.Boot")
lazy val root = (project in file(".")).
settings(commonSettings: _*).
settings(
name := "test",
resolvers += "spray repo" at "http://repo.spray.io",
libraryDependencies ++= {
val akkaV = "2.3.9"
val sprayV = "1.3.3"
Seq(
"io.spray" %% "spray-can" % sprayV,
"io.spray" %% "spray-routing" % sprayV,
"io.spray" %% "spray-json" % "1.3.2",
"io.spray" %% "spray-testkit" % sprayV % "test",
"com.typesafe.akka" %% "akka-actor" % akkaV,
"com.typesafe.akka" %% "akka-testkit" % akkaV % "test",
"org.specs2" %% "specs2-core" % "2.3.11" % "test",
"com.sksamuel.elastic4s" %% "elastic4s-core" % "2.1.0",
"com.sksamuel.elastic4s" %% "elastic4s-jackson" % "2.1.0",
"net.liftweb" %% "lift-json" % "2.6+"
)
}
)
assemblyOption in assembly := (assemblyOption in assembly).value.copy(cacheUnzip = false)
assemblyMergeStrategy in assembly := {
case "BaseDateTime.class" => MergeStrategy.first
}
Don't know why the error is coming.
The setting assemblyMergeStrategy in assembly has the type String => MergeStrategy.
In your sbt file you are using the partial function
{
case "BaseDateTime.class" => MergeStrategy.first
}
which is syntactic sugar for
(s:String) => {
s match {
case "BaseDateTime.class" => MergeStrategy.first
}
}
This representation shows that the given function will not exhaustively match all passed strings. In your case sbt-assembly tried to merge the file named spray\http\parser\ProtocolParameterRules$$anonfun$DeltaSeconds$1.class into the fat jar, but could not find any matching merge strategy. You need a "default" case also:
(s:String) => {
s match {
case "BaseDateTime.class" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
}
Or written as partial function:
{
case "BaseDateTime.class" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}
I also ran into the same issue when sbt-assembly's assembly task failed to create a fat jar due to a name conflict in the elasticsearch and its transitive joda-time dependencies. Elasticsearch redefines the class org.joda.time.base.BaseDateTime which is already implemented in the joda-time library. I've followed your approach to tell sbt-assembly how resolve this conflict using the following assemblyMergeStrategy:
assemblyMergeStrategy in assembly := {
case "org/joda/time/base/BaseDateTime.class" => MergeStrategy.first
case x =>
val oldStrategy = (assemblyMergeStrategy in assembly).value
oldStrategy(x)
}