I have a sample codebase in Scala where I use OpenCV and ScalaPy for doing some image classification. Here is the codebit:
def loadImage(imagePath: String): Image = {
// 0. Load the image and extract class label where a path to the image is assumed to be
// /path/to/dataset/{class}/{image}.jpg
val matrix: Mat = imread(imagePath)
val label = imagePath.split("")
// 1. Run the loaded image through the preprocessors, resulting in a feature vector
//val preProcessedImagesWithLabels = Seq(new ImageResizePreProcessor(appCfg)).map(preProcessor => (preProcessor.preProcess(matrix), label))
val preProcessedImagesWithLabels = Seq(new ImageResizePreProcessor(appCfg)).map(preProcessor => preProcessor.preProcess(matrix))
np.asarray(preProcessedImagesWithLabels)
}
It fails however for the reason that it cannot find an implicit converter for NumPy:
[error] /home/joesan/Projects/Private/ml-projects/object-classifier/src/main/scala/com/bigelectrons/animalclassifier/ImageLoader.scala:10:34: not found: type NumPy
[error] val np = py.module("numpy").as[NumPy]
What is to be expected in addition to the imports?
"me.shadaj" %% "scalapy-numpy" % "0.1.0",
"me.shadaj" %% "scalapy-core" % "0.5.0",
Try with latest "dev" version of scalapy-numpy: 0.1.0+6-14ca0424
So change the sbt build in:
libraryDependencies += "me.shadaj" %% "scalapy-numpy" % "0.1.0+6-14ca0424"
I try in ammonite this script:
import me.shadaj.scalapy.numpy.NumPy
import me.shadaj.scalapy.py
val np = py.module("numpy").as[NumPy]
An it seems to find the NumPy as expected
Related
I find the following simple example hangs indefinitely for me in the Scala REPL (sbt console):
import org.apache.spark.sql._
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val rdd = sc.parallelize(1 to 100000000)
val n = rdd.map(_ + 1).sum
However, the following works just fine:
import org.apache.spark.sql._
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val rdd1 = sc.parallelize(1 to 100000000)
val rdd2 = rdd1.map(_ + 1)
val n = rdd2.sum
I'm very confused by this, and was hoping somebody had an explanation... assuming they can reproduce the 'issue'.
This is basically just the example provided on the Almond kernel's Spark documentation page, and it does work just fine in Jupyter using the Almond kernel. Also, sbt "runMain Main" works just fine for the following:
import org.apache.spark.sql._
object Main extends App {
val spark = SparkSession.builder().master("local[*]").getOrCreate()
val sc = spark.sparkContext
val rdd = sc.parallelize(1 to 100000000)
val n = rdd.map(_ + 1).sum
println(s"\n\nn: $n\n\n")
spark.stop
}
For completeness, I'm using a very simple build.sbt file, which looks as follows:
name := """sparktest"""
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.6"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.6"
I noticed a bunch of errors like the following when I killed the console:
08:53:36 ERROR Executor:70 - Exception in task 2.0 in stage 0.0 (TID 2): Could not initialize class $line3.$read$$iw$$iw$$iw$$iw$
This led me to:
Lambda in REPL (using object-wrappers) + concurrency = deadlock #9076
It seems that my problem is this same thing and is specific to Scala 2.12. Adding the following line to build.sbt seems to be the accepted workaround:
scalacOptions += "-Ydelambdafy:inline"
I've been using cats.data.Validated successfully to solve the problem below, but have come into a problem using my existing solution for a case class with more than 22 members (because the constructor cannot be made into a Function).
Here is my goal: generate a bunch of ValidatedNel[E, T], sequence them into a ValidatedNel[E, (T1, T2, ...)], then mapN(DAOClass) (where DAOClass is a case class with the specified arguments). This works for fewer than 22 arguments, but fails with more because of two problems:
(T1, T2, ...) cannot have more than 22 components
DAOClass.apply cannot be transformed into a Function
So I am looking into using shapeless.HList to handle part 1 and having problems. I should be able to use Generic[DAOClass] to satisfactorily handle part 2 when I get to it, or if that doesn't work, use extensible records with a bit more boilerplate.
Here is some small example code (not with 22 components):
package example
import cats.syntax.validated._
import cats.data.ValidatedNel
import cats.sequence._
import shapeless._
case class DAOClass(a: Int, b: Int)
object DAOClass {
def generate: ValidatedNel[String, DAOClass] = {
val hlist: ValidatedNel[String, Int] :: ValidatedNel[String, Int] :: HNil =
1.validNel :: 2.validNel :: HNil
val hlistSequence: ValidatedNel[String, Int :: Int :: HNil] = hlist.sequence
hlistSequence.map(Generic[DAOClass].from)
}
}
This uses the kittens library to sequence the HList.
Unfortunately, this gives me a compile error:
[error] ...src/main/scala/example/DAOClass.scala:17:73: cannot construct sequencer, make sure that every item of your hlist shapeless.:: [cats.data.ValidatedNel[String,Int],shapeless.::[cats.data.ValidatedNel[String,Int],shapeless.HNil]] is an Apply
[error] val hlistSequence: ValidatedNel[String, ::[Int, ::[Int, HNil]]] = hlist.sequence
[error] ^
I have extracted this into a test project; here's my build.sbt:
lazy val root = (project in file(".")).
settings(
inThisBuild(List(
organization := "com.example",
scalaVersion := "2.12.6",
version := "0.1.0-SNAPSHOT"
)),
name := "shapeless-validation",
resolvers ++= Seq(
Resolver.sonatypeRepo("releases")
),
libraryDependencies ++= Seq(
"com.chuusai" %% "shapeless" % "2.3.3",
"org.scalatest" %% "scalatest" % "3.0.5" % "test",
"org.typelevel" %% "cats-core" % "1.1.0",
"org.typelevel" %% "kittens" % "1.1.0"
)
)
What am I missing? Do I need to import more implicits somewhere? Is there a better way to do this?
You forgot to add
scalacOptions += "-Ypartial-unification"
to build.sbt. For normal work with cats this is usually mandatory.
Now
hlistSequence.map(Generic[DAOClass].from)
produces a ValidatedNel[String, DAOClass]:
println(DAOClass.generate) // Valid(DAOClass(1,2))
I use Spark 2.3.0.
The following code fragment works fine in spark-shell:
def transform(df: DataFrame): DataFrame = {
df.select(
explode($"person").alias("p"),
$"history".alias("h"),
$"company_id".alias("id")
)
Yet when editing within Intellij, it will not recognize the select, explode and $ functions. These are my dependencies within SBT:
version := "1.0"
scalaVersion := "2.11.8"
libraryDependencies ++= {
val sparkVer = "2.1.0"
Seq(
"org.apache.spark" %% "spark-core" % sparkVer % "provided" withSources(),
"org.apache.spark" %% "spark-sql" % sparkVer % "provided" withSources()
)
}
Is there anything missing? An import statement, or an additional library?
You should use the following import in the transform method (to have explode available):
import org.apache.spark.sql.functions._
You could also do the following to be precise on what you import.
import org.apache.spark.sql.functions.explode
It works in spark-shell since it does the import by default (so you don't have to worry about such simple things :)).
scala> spark.version
res0: String = 2.3.0
scala> :imports
1) import org.apache.spark.SparkContext._ (69 terms, 1 are implicit)
2) import spark.implicits._ (1 types, 67 terms, 37 are implicit)
3) import spark.sql (1 terms)
4) import org.apache.spark.sql.functions._ (354 terms)
As to $ it is also imported by default in spark-shell for your convenience. Add the following to have it in your method.
import spark.implicits._
Depending on where you have transform method defined you may add an implicit parameter to the transform method as follows (and skip adding the import above):
def transform(df: DataFrame)(implicit spark: SparkSession): DataFrame = {
...
}
I'd however prefer using the SparkSession bound to the input DataFrame (which seems cleaner and...geeker :)).
def transform(df: DataFrame): DataFrame = {
import df.sparkSession.implicits._
...
}
As a bonus, I'd also cleanup your build.sbt so it would look as follows:
libraryDependencies += "org.apache.spark" %% "spark-sql" % 2.1.0" % "provided" withSources()
You won't be using artifacts from spark-core in your Spark SQL applications (and it's a transitive dependency of spark-sql).
Intellij does not have spark.implicits._ library available, therefore explode throws an error. Do remember to create the SparkSession.builder() object before importing.
Apply the following code, this works:
val spark = SparkSession.builder()
.master("local")
.appName("ReadDataFromTextFile")
.getOrCreate()
import spark.implicits._
val jsonFile = spark.read.option("multiLine", true).json("d:/jsons/rules_dimensions_v1.json")
jsonFile.printSchema()
//jsonFile.select("tag").select("name").show()
jsonFile.show()
val flattened = jsonFile.withColumn("tag", explode($"tag"))
flattened.show()
I'm trying to read RDF\XML file into Apache spark (scala 2.11, apache spark 1.4.1) using Apache Jena. I wrote this scala snippet:
val factory = new RdfXmlReaderFactory()
HadoopRdfIORegistry.addReaderFactory(factory)
val conf = new Configuration()
conf.set("rdf.io.input.ignore-bad-tuples", "false")
val data = sc.newAPIHadoopFile(path,
classOf[RdfXmlInputFormat],
classOf[LongWritable], //position
classOf[TripleWritable], //value
conf)
data.take(10).foreach(println)
But it throws an error:
INFO readers.AbstractLineBasedNodeTupleReader: Got split with start 0 and length 21765995 for file with total length of 21765995
15/07/23 01:52:42 ERROR readers.AbstractLineBasedNodeTupleReader: Error parsing whole file, aborting further parsing
org.apache.jena.riot.RiotException: Producer failed to ever call start(), declaring producer dead
at org.apache.jena.riot.lang.PipedRDFIterator.hasNext(PipedRDFIterator.java:272)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:242)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
...
ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Error parsing whole file at position 0, aborting further parsing
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:285)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
The file is good, because i can parse it locally. What do I miss?
EDIT
Some information to reproduce the behaviour
Imports:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.LongWritable
import org.apache.jena.hadoop.rdf.io.registry.HadoopRdfIORegistry
import org.apache.jena.hadoop.rdf.io.registry.readers.RdfXmlReaderFactory
import org.apache.jena.hadoop.rdf.types.QuadWritable
import org.apache.spark.SparkContext
scalaVersion := "2.11.7"
dependencies:
"org.apache.hadoop" % "hadoop-common" % "2.7.1",
"org.apache.hadoop" % "hadoop-mapreduce-client-common" % "2.7.1",
"org.apache.hadoop" % "hadoop-streaming" % "2.7.1",
"org.apache.spark" % "spark-core_2.11" % "1.4.1",
"com.hp.hpl.jena" % "jena" % "2.6.4",
"org.apache.jena" % "jena-elephas-io" % "0.9.0",
"org.apache.jena" % "jena-elephas-mapreduce" % "0.9.0"
I'm using sample rdf from here. It's freely available information about John Peel sessions (more info about dump).
So it appears your problem was down to you manually managing your dependencies.
In my environment I was simply passing the following to my Spark shell:
--packages org.apache.jena:jena-elephas-io:0.9.0
This does all the dependency resolution for you
If you are building a SBT project then it should be sufficient to do the following in your build.sbt:
libraryDependencies += "org.apache.jena" % "jena-elephas-io" % "0.9.0"
Thx all for discussion in comments. The problem was really tricky and not clear from the stack trace: code needs one extra dependency to work jena-core and this dependency must be packaged first.
"org.apache.jena" % "jena-core" % "2.13.0"
"com.hp.hpl.jena" % "jena" % "2.6.4"
I use this assembly strategy:
lazy val strategy = assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { (old) => {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard
case _ => MergeStrategy.discard
}
case x => MergeStrategy.first
}
}
I'm trying to run a very simple example using Rapture.io. Not sure what I'm missing here?
scala> import rapture.io._
import rapture.io._
scala> import rapture.core._
import rapture.core._
scala> val x = File / "tmp" / "a.txt"
<console>:20: error: value / is not a member of object java.io.File
val x = File / "tmp" / "a.txt"
^
scala> import java.io.File
import java.io.File
scala> val x = File / "tmp" / "a.txt"
<console>:21: error: value / is not a member of object java.io.File
val x = File / "tmp" / "a.txt"
^
scala>
You need to include following dependency in build.sbt
libraryDependencies += "com.propensive" %% "rapture-fs" % "0.9.1"
Where the version number (i.e. 0.9.1) should reflect the one currently available, which usually corresponds to the version of rapture-core you're using
Then, in the source code
import rapture.fs._
Do not import java.io.File. Otherwise it will create ambiguity.
See this link for more info.
https://groups.google.com/forum/#!topic/rapture-users/N3-wIBKuNaA