How to compile a spark-cassandra programs using scala? - scala

Lately I started learning spark and cassandra, I know that we can use spark in both python and scala and java, and I 've read docs on this website: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md, the thing is, after I create a program named testfile.scala with those codes the document says,(I don't know if I am right using .scala), however, i don't know how to compile it,can anyone guide me what to do with it?
Here are the testfile.scala:
import com.datastax.spark.connector._
import com.datastax.spark.connector.streaming._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val ssc = new StreamingContext(conf, Seconds(n))
val stream = ssc.actorStream[String](Props[SimpleStreamingActor], actorName, StorageLevel.MEMORY_AND_DISK)
val wc = stream.flatMap(_.split("\\s+")).map(x => (x, 1)).reduceByKey(_ + _).saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.first)
println(rdd.map(_.getInt("value")).sum)

Scala projects are compiled by scalac, but it's quite low level: you have to setup build paths and manage all dependencies, so most people fall back to some build tool such as sbt which will manage a lot of stuff for you. The other two commonly used built tools are maven, which is favored by java old-schoolers and gradle, which is more down to earth
> how to import spark-cassandra-connector
I've set up example project. Basically, you define all of your dependencies in built.sbt or it's analog, here is how dependency on spark-cassandra-connector is defined (line #12).
> And, is it a rule that we have to code with class or object
Yes and no. If you code with sbt, all your code files has to be wrapped into object, but, sbt allows you to code in it's shell and code that you input to it is not required to be wrapped (same rules as with ordinary scala REPL). Next, both IDEA and Eclipse have worksheet capabilities, so you can create test.sc and draft your code there.

Related

How to reuse Ammonite REPL's sc files in a sbt project?

I have some reusable Ammonite REPL's sc files that were used in some Jupyter Scala notebooks.
Now I am creating a standalone application built from sbt. I hope I can reuse these existing sc files in the sbt project.
Is it possible to share these sc files for both Jupyter Scala/Ammonite REPL and sbt projects? How to make scala sources and sc files compile together?
I created Import.scala, a Scala compiler plugin that enables magic imports.
With the help of Import.scala, code snippets in a .sc file can be loaded into Scala source file in a sbt project with the same syntax as Ammonite or Jupyter Scala:
Given a MyScript.sc file.
// MyScript.sc
val elite = 31337
Magic import it in another file.
import $file.MyScript
It works.
assert(MyScript.elite == 31337)

object scala in compiler mirror not found - running Scala compiler programatically [no sbt - no IDE] [duplicate]

I'm trying to run a Scala application packed as JAR (including dependencies) but this fails until the Scala library is added by using the -Xbootclasspath/p option.
Failing invocation:
java -jar /path/to/target/scala-2.10/application-assembly-1.0.jar
After the application did some of its intended output, the console shows:
Exception in thread "main"
scala.reflect.internal.MissingRequirementError: object scala.runtime
in compiler mirror not found.
at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
at scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:181)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:181)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:182)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:182)
at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1015)
at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1014)
at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1144)
at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1143)
at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1187)
at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1187)
at scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1252)
at scala.tools.nsc.Global$Run.(Global.scala:1290)
at extract.ScalaExtractor$Compiler$2$.(ScalaExtractor.scala:24)
Working invocation:
java -Xbootclasspath/p:/path/to/home/.sbt/boot/scala-2.10.2/lib/scala-library.jar -jar /path/to/target/scala-2.10/application-assembly-1.0.jar
The strange thing about it is that the application-assembly-1.0.jar was built so that it includes all dependencies including the Scala library. When one extracts the JAR file it can be verified that the class files in the scala.runtime package have been included.
Creation of the JAR file
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.9.1") was added to project/plugins.sbt and the assembly target was invoked. A JAR file of about 25MB results.
Building the JAR with proguard shows the same runtime behavior as seen with assembly's JAR file.
Application code that triggers the MissingRequirementError
Some application code works fine and the previously described exception is triggered as soon as the new Run from the following fragment executes.
import scala.reflect.internal.util.BatchSourceFile
import scala.reflect.io.AbstractFile
import scala.reflect.io.Path.jfile2path
import scala.tools.nsc.Global
import scala.tools.nsc.Settings
…
import scala.tools.nsc._
object Compiler extends Global(new Settings()) {
new Run // This is line 24 from the stack trace!
def parse(path: File) = {
val code = AbstractFile.getFile(path)
val bfs = new BatchSourceFile(code, code.toCharArray)
val parser = new syntaxAnalyzer.UnitParser(new CompilationUnit(bfs))
parser.smartParse()
}
}
val ast = Compiler.parse(file)
Among others, scala-library, scala-compiler and scala-reflect are defined as dependencies in build.sbt.
For the curios / background information
The aim of the application is to aid in localization of Java and Scala programs. The task of the code fragment above is to get an AST from a Scala file in order to find method calls in there.
The questions
Given the Scala library is included in the JAR file, why is necessary to call the JAR using -Xbootclasspath/p:scala-library.jar?
Why do other parts of the application run just fine even though scala.runtime is reported as missing later?
The easy way to configure the settings with familiar keystrokes:
import scala.tools.nsc.Global
import scala.tools.nsc.Settings
def main(args: Array[String]) {
val s = new Settings
s processArgumentString "-usejavacp"
val g = new Global(s)
val r = new g.Run
}
That works for your scenario.
Even easier:
java -Dscala.usejavacp=true -jar ./scall.jar
Bonus info, I happened to come across the enabling commit message:
Went ahead and implemented classpaths as described in email to
scala-internals on the theory that at this point I must know what I'm
doing.
** PUBLIC SERVICE ANNOUNCEMENT **
If your code of whatever kind stopped working with this commit (most
likely the error is something like "object scala not found") you can
get it working again with either of:
passing -usejavacp on the command line
set system property "scala.usejavacp" to "true"
Either of these will alert scala that you want the java application
classpath to be utilized by scala as well.

Spark - "sbt package" - "value $ is not a member of StringContext" - Missing Scala plugin?

When running "sbt package" from the command line for a small Spark Scala application, I'm getting the "value $ is not a member of StringContext" compilation error on the following line of code:
val joined = ordered.join(empLogins, $"login" === $"username", "inner")
.orderBy($"count".desc)
.select("login", "count")
Intellij 13.1 is giving me the same error message. The same .scala source code gets compiled without any issue in Eclipse 4.4.2. And also it works well with maven in a separate maven project from the command line.
It looks like sbt doesn't recognize the $ sign because I'm missing some plugin in my project/plugins.sbt file or some setting in my build.sbt file.
Are you familiar with this issue? Any pointers will be appreciated. I can provide build.sbt and/or project/plugins.sbt if needed be.
You need to make sure you import sqlContext.implicits._
This gets you implicit class StringToColumn extends AnyRef
Which is commented as:
Converts $"col name" into an Column.
In Spark 2.0+
$-notation for columns can be used by importing implicit on SparkSession object (spark)
val spark = org.apache.spark.sql.SparkSession.builder
.master("local")
.appName("App name")
.getOrCreate;
import spark.implicits._
then your code with $ notation
val joined = ordered.join(empLogins, $"login" === $"username", "inner")
.orderBy($"count".desc)
.select("login", "count")
Great answer guys, if resolving import is a concern, then will this work
import org.apache.spark.sql.{SparkSession, SQLContext}
val ss = SparkSession.builder().appName("test").getOrCreate()
val dataDf = ...
import ss.sqlContext.implicits._
dataDf.filter(not($"column_name1" === "condition"))

Running Spark Application from Eclipse

I am trying to develop a spark application on Eclipse, and then debug it by stepping through it.
I downloaded the Spark source code and I have added some of the spark sub projects(such as spark-core) to Eclipse. Now, I am trying to develop a spark application using Eclipse. I have already installed the ScalaIDE on Eclipse. I created a simple application based on the example given in the Spark website.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
To my project, I added the spark-core project as a dependent project(right click -> build path -> add project). Now, I am trying to build my application and run it. However, my project shows that it has errors, but I don't see any errors listed in the problems view within Eclipse, nor do I see any lines highlighted in red. So, I am not sure what the problem is. My assumption is that I need to add external jars to my project, but I am not sure what these jars would be. The error is caused by val conf = new SparkConf().setAppName("Simple Application") and the subsequent lines. I tried removing those lines, and the error went away. I would appreciate any help and guidance, thanks!
It seems you are not using any package/library manager (e.g. sbt, maven) which should eliminate any versioning issues. It might be challenging to set correct version of java, scala, spark and all its subsequent dependencies on your own.
I strongly recommend to change your your project into Maven:
Convert Existing Eclipse Project to Maven Project
Personally, I have very good experiences with sbt on IntelliJ IDEA (https://confluence.jetbrains.com/display/IntelliJIDEA/Getting+Started+with+SBT) which is easy to set up and maintain.
I've just created a Maven archetype for Spark the other day.
It sets up a new Spark 1.3.0 project in Eclipse/Idea with Scala 2.10.4.
Just follow the instructions here.
You'll just have to change the Scala version after the project is generated:
Right click on the generated project and select:
Scala > Set the Scala Installation > Fixed 2.10.5.(bundled)
The default version that comes with ScalaIDE (currently 2.11.6) is automatically added to the project by ScalaIDE when it detects scala-maven-plugin in the pom.
I'd appreciate a feedback, if someone knows how to set the Scala library container version from Maven, while it bootstraps a new project. Where does the ScalaIDE look up the Scala version, if anywhere?
BTW Just make sure you download sources (Project right-click > Maven > Download sources) before stepping into Spark code in debugger.
If you want to use (IMHO the very best) Eclipse goodies (References, Type hierarchy, Call hierarchy) you'll have to build Spark yourself, so that all the sources are on your build path (as Maven Scala dependencies are not processed by EclipseIDE/JDT, even though they are, of course, on the build path).
Have fun debugging, I can tell you that it helped me tremendously to get deeper into Spark and really understand how it works :)
You could try to add the spark-assembly.jar instead.
As other have noted, the better way is to use Sbt (or Maven) to manage your dependencies. spark-core has many dependencies itself, and adding just that one jar won't be enough.
You haven't specified the master in you spark code. Since you're running it on your local machine. Replace following line
val conf = new SparkConf().setAppName("Simple Application")
with
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
Here "local[2]" means 2 threads will be used.

"scala.runtime in compiler mirror not found" but working when started with -Xbootclasspath/p:scala-library.jar

I'm trying to run a Scala application packed as JAR (including dependencies) but this fails until the Scala library is added by using the -Xbootclasspath/p option.
Failing invocation:
java -jar /path/to/target/scala-2.10/application-assembly-1.0.jar
After the application did some of its intended output, the console shows:
Exception in thread "main"
scala.reflect.internal.MissingRequirementError: object scala.runtime
in compiler mirror not found.
at scala.reflect.internal.MissingRequirementError$.signal(MissingRequirementError.scala:16)
at scala.reflect.internal.MissingRequirementError$.notFound(MissingRequirementError.scala:17)
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:48)
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:40)
at scala.reflect.internal.Mirrors$RootsBase.getModuleOrClass(Mirrors.scala:61)
at scala.reflect.internal.Mirrors$RootsBase.getPackage(Mirrors.scala:172)
at scala.reflect.internal.Mirrors$RootsBase.getRequiredPackage(Mirrors.scala:175)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage$lzycompute(Definitions.scala:181)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackage(Definitions.scala:181)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass$lzycompute(Definitions.scala:182)
at scala.reflect.internal.Definitions$DefinitionsClass.RuntimePackageClass(Definitions.scala:182)
at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr$lzycompute(Definitions.scala:1015)
at scala.reflect.internal.Definitions$DefinitionsClass.AnnotationDefaultAttr(Definitions.scala:1014)
at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses$lzycompute(Definitions.scala:1144)
at scala.reflect.internal.Definitions$DefinitionsClass.syntheticCoreClasses(Definitions.scala:1143)
at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode$lzycompute(Definitions.scala:1187)
at scala.reflect.internal.Definitions$DefinitionsClass.symbolsNotPresentInBytecode(Definitions.scala:1187)
at scala.reflect.internal.Definitions$DefinitionsClass.init(Definitions.scala:1252)
at scala.tools.nsc.Global$Run.(Global.scala:1290)
at extract.ScalaExtractor$Compiler$2$.(ScalaExtractor.scala:24)
Working invocation:
java -Xbootclasspath/p:/path/to/home/.sbt/boot/scala-2.10.2/lib/scala-library.jar -jar /path/to/target/scala-2.10/application-assembly-1.0.jar
The strange thing about it is that the application-assembly-1.0.jar was built so that it includes all dependencies including the Scala library. When one extracts the JAR file it can be verified that the class files in the scala.runtime package have been included.
Creation of the JAR file
addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.9.1") was added to project/plugins.sbt and the assembly target was invoked. A JAR file of about 25MB results.
Building the JAR with proguard shows the same runtime behavior as seen with assembly's JAR file.
Application code that triggers the MissingRequirementError
Some application code works fine and the previously described exception is triggered as soon as the new Run from the following fragment executes.
import scala.reflect.internal.util.BatchSourceFile
import scala.reflect.io.AbstractFile
import scala.reflect.io.Path.jfile2path
import scala.tools.nsc.Global
import scala.tools.nsc.Settings
…
import scala.tools.nsc._
object Compiler extends Global(new Settings()) {
new Run // This is line 24 from the stack trace!
def parse(path: File) = {
val code = AbstractFile.getFile(path)
val bfs = new BatchSourceFile(code, code.toCharArray)
val parser = new syntaxAnalyzer.UnitParser(new CompilationUnit(bfs))
parser.smartParse()
}
}
val ast = Compiler.parse(file)
Among others, scala-library, scala-compiler and scala-reflect are defined as dependencies in build.sbt.
For the curios / background information
The aim of the application is to aid in localization of Java and Scala programs. The task of the code fragment above is to get an AST from a Scala file in order to find method calls in there.
The questions
Given the Scala library is included in the JAR file, why is necessary to call the JAR using -Xbootclasspath/p:scala-library.jar?
Why do other parts of the application run just fine even though scala.runtime is reported as missing later?
The easy way to configure the settings with familiar keystrokes:
import scala.tools.nsc.Global
import scala.tools.nsc.Settings
def main(args: Array[String]) {
val s = new Settings
s processArgumentString "-usejavacp"
val g = new Global(s)
val r = new g.Run
}
That works for your scenario.
Even easier:
java -Dscala.usejavacp=true -jar ./scall.jar
Bonus info, I happened to come across the enabling commit message:
Went ahead and implemented classpaths as described in email to
scala-internals on the theory that at this point I must know what I'm
doing.
** PUBLIC SERVICE ANNOUNCEMENT **
If your code of whatever kind stopped working with this commit (most
likely the error is something like "object scala not found") you can
get it working again with either of:
passing -usejavacp on the command line
set system property "scala.usejavacp" to "true"
Either of these will alert scala that you want the java application
classpath to be utilized by scala as well.