How to reuse Ammonite REPL's sc files in a sbt project? - scala

I have some reusable Ammonite REPL's sc files that were used in some Jupyter Scala notebooks.
Now I am creating a standalone application built from sbt. I hope I can reuse these existing sc files in the sbt project.
Is it possible to share these sc files for both Jupyter Scala/Ammonite REPL and sbt projects? How to make scala sources and sc files compile together?

I created Import.scala, a Scala compiler plugin that enables magic imports.
With the help of Import.scala, code snippets in a .sc file can be loaded into Scala source file in a sbt project with the same syntax as Ammonite or Jupyter Scala:
Given a MyScript.sc file.
// MyScript.sc
val elite = 31337
Magic import it in another file.
import $file.MyScript
It works.
assert(MyScript.elite == 31337)

Related

How do I make packages available to the Scala REPL?

I'm trying to get familiar with Scala. I am using macOS.
I've installed scala using brew install scala which is hassle-free and once complete I can launch the scala REPL simply by issuing scala and I'm at the scala> prompt.
I now want to import some packages, so I try:
import org.apache.spark.sql.Column
and unsurprisingly it fails with
error: object apache is not a member of package org
This makes sense, how would it know where to get that package from? Thing is, I don't know what I need to do to make that package available. Is there anything I can do from the command-line that would allow me to import org.apache.spark.sql.Column?
I have googled around a little but not found anything that explains in a jargon-free way. Complete Scala noob here so jargon-free responses would be appreciated.
Here are two ways to start a REPL with dependencies that I'm aware of:
Use SBT to manage dependencies, use console to start a REPL with those dependencies
Use Ammonite REPL
You could create a separate directory with a build.sbt where you set
scalaVersion := "2.11.12"
and then copy the
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
snippets from MavenCentral. Then you can run the REPL with sbt console. Note that this will create a project and target subdirectories, so it "leaves traces", you can't use it like the standalone scala-repl. You could also omit the build.sbt, and add the library-dependencies by typing them into the SBT-shell itself.
Alternatively you can just use Ammonite REPL that has been created exactly for that purpose.
You can use classpath to make the lib available i.e. download the jar locally and use the command as follows (here I'm using Apache IO lib to move files from scala prompt )
C0:Desktop pvangala$ scala -cp /Users/pvangala/Downloads/commons-io-2.6/commons-io-2.6.jar
Welcome to Scala 2.12.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_161).
Type in expressions for evaluation. Or try :help.
scala> import java.io.File
import java.io.File
scala> val src = new File("/Users/pvangala/Downloads/commons-io-2.6-bin.tar")
src: java.io.File = /Users/pvangala/Downloads/commons-io-2.6-bin.tar
scala> val dst = new File("/Users/pvangala/Desktop")
dst: java.io.File = /Users/pvangala/Desktop
scala> org.apache.commons.io.FileUtils.moveFileToDirectory(src, dst, true)
If you want to use spark stuff I'd recommend you use the spark-shell that comes with the spark-installation. I don't know macOS so I can't help you much with the install of Spark there.
For normal Scala I recommend ammonite http://ammonite.io/#Ammonite-REPL that has included syntax to allow to pull packages/dependencies.
If you want to use spark, you should use the spark-shell instead the scala REPL. It has almost the same behaviour but includes all the spark dependencies by default.
You should download spark binaries from here
Then if you are using Linux, you should create the variable SPARK_HOME pointing to the downloaded folder and include its bin folder in PATH.
then you can start it in any console with the command spark-shell
In Windows i'm not sure, but i think that you should have a spark-shell.cmd file or something similar which you can use to start the spark-shell,
I did the following in Windows:
for /f "tokens=*" %%a in ('java -jar coursier fetch -p "com.lihaoyi::requests:0.2.0" "com.lihaoyi::upickle:0.7.5"') do set SCP=%%a
scala -nc -classpath %SCP% %1 %2 %3
Instead of the two libraries listed here you can use an unlimited number of other libraries you need. They must be available in maven central, though. The %1 could be a scala script (".sc" extension). But you could leave it empty and the REPL will start with the libraries on the classpath.

eclipse(set with scala envirnment) : object apache is not a member of package org

As shown in image, its giving error when i am importing the Spark packages. Please help. When i hover there, it shows "object apache is not a member of package org".
I searched on this error, it shows spark jars has not been imported. So, i imported "spark-assembly-1.4.1-hadoop2.2.0.jar" too. But still same error.Below is what i actually want to run:
import org.apache.spark.{SparkConf, SparkContext}
object ABC {
def main(args: Array[String]){
//Scala Main Method
println("Spark Configuration")
val conf = new SparkConf()
conf.setAppName("My First Spark Scala Application")
conf.setMaster("spark://ip-10-237-224-94:7077")
println("Creating Spark Context")
}
}
Adding spark-core jar in your classpath should resolve your issue. Also if you are using some build tools like Maven or Gradle (if not then you should because spark-core has lot many dependencies and you would keep getting such problem for different jars), try to use Eclipse task provided by these tools to properly set classpath in your project.
I was also receiving the same error, in my case it was compatibility issue. As Spark 2.2.1 is not compatible with Scala 2.12(it is compatible with 2.11.8) and my IDE was supporting Scala 2.12.3.
I resolved my error by
1) Importing the jar files from the basic folder of Spark. During the installation of Spark in our C drive we have a folder named Spark which contains Jars folder in it. In this folder one can get all the basic jar files.
Goto to Eclipse right click on the project -> properties-> Java Build Path. Under 'library' category we will get an option of ADD EXTERNAL JARs.. Select this option and import all the jar files of 'jars folder'. click on Apply.
2) Again goto properties -> Scala Compiler ->Scala Installation -> Latest 2.11 bundle (dynamic)*
*before selecting this option one should check the compatibility of SPARK and SCALA.
The problem is Scala is NOT backward compatible. Hence each Spark module is complied against specific Scala library. But when we run from eclipse, we have one SCALA VERSION which was used to compile and create the spark Dependency Jar which we add to the build path, and SECOND SCALA VERSION is there as the eclipse run time environment. Both may conflict.
This is a hard reality, although, we wish Scala to be ,backward compatible. Or at least a complied jar file created could be backward compatible.
Hence, the recommendation is , use Maven or similar where dependency version can be managed.
If you are doing this in the context of Scala within a Jupyter Notebook, you'll get this error. You have to install the Apache Toree kernel:
https://github.com/apache/incubator-toree
and create your notebooks with that kernel.
You also have to start the Jupyter Notebook with:
pyspark

Running Spark Application from Eclipse

I am trying to develop a spark application on Eclipse, and then debug it by stepping through it.
I downloaded the Spark source code and I have added some of the spark sub projects(such as spark-core) to Eclipse. Now, I am trying to develop a spark application using Eclipse. I have already installed the ScalaIDE on Eclipse. I created a simple application based on the example given in the Spark website.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SimpleApp {
def main(args: Array[String]) {
val logFile = "YOUR_SPARK_HOME/README.md" // Should be some file on your system
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val logData = sc.textFile(logFile, 2).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println("Lines with a: %s, Lines with b: %s".format(numAs, numBs))
}
}
To my project, I added the spark-core project as a dependent project(right click -> build path -> add project). Now, I am trying to build my application and run it. However, my project shows that it has errors, but I don't see any errors listed in the problems view within Eclipse, nor do I see any lines highlighted in red. So, I am not sure what the problem is. My assumption is that I need to add external jars to my project, but I am not sure what these jars would be. The error is caused by val conf = new SparkConf().setAppName("Simple Application") and the subsequent lines. I tried removing those lines, and the error went away. I would appreciate any help and guidance, thanks!
It seems you are not using any package/library manager (e.g. sbt, maven) which should eliminate any versioning issues. It might be challenging to set correct version of java, scala, spark and all its subsequent dependencies on your own.
I strongly recommend to change your your project into Maven:
Convert Existing Eclipse Project to Maven Project
Personally, I have very good experiences with sbt on IntelliJ IDEA (https://confluence.jetbrains.com/display/IntelliJIDEA/Getting+Started+with+SBT) which is easy to set up and maintain.
I've just created a Maven archetype for Spark the other day.
It sets up a new Spark 1.3.0 project in Eclipse/Idea with Scala 2.10.4.
Just follow the instructions here.
You'll just have to change the Scala version after the project is generated:
Right click on the generated project and select:
Scala > Set the Scala Installation > Fixed 2.10.5.(bundled)
The default version that comes with ScalaIDE (currently 2.11.6) is automatically added to the project by ScalaIDE when it detects scala-maven-plugin in the pom.
I'd appreciate a feedback, if someone knows how to set the Scala library container version from Maven, while it bootstraps a new project. Where does the ScalaIDE look up the Scala version, if anywhere?
BTW Just make sure you download sources (Project right-click > Maven > Download sources) before stepping into Spark code in debugger.
If you want to use (IMHO the very best) Eclipse goodies (References, Type hierarchy, Call hierarchy) you'll have to build Spark yourself, so that all the sources are on your build path (as Maven Scala dependencies are not processed by EclipseIDE/JDT, even though they are, of course, on the build path).
Have fun debugging, I can tell you that it helped me tremendously to get deeper into Spark and really understand how it works :)
You could try to add the spark-assembly.jar instead.
As other have noted, the better way is to use Sbt (or Maven) to manage your dependencies. spark-core has many dependencies itself, and adding just that one jar won't be enough.
You haven't specified the master in you spark code. Since you're running it on your local machine. Replace following line
val conf = new SparkConf().setAppName("Simple Application")
with
val conf = new SparkConf().setAppName("Simple Application").setMaster("local[2]")
Here "local[2]" means 2 threads will be used.

How to compile a spark-cassandra programs using scala?

Lately I started learning spark and cassandra, I know that we can use spark in both python and scala and java, and I 've read docs on this website: https://github.com/datastax/spark-cassandra-connector/blob/master/doc/0_quick_start.md, the thing is, after I create a program named testfile.scala with those codes the document says,(I don't know if I am right using .scala), however, i don't know how to compile it,can anyone guide me what to do with it?
Here are the testfile.scala:
import com.datastax.spark.connector._
import com.datastax.spark.connector.streaming._
val conf = new SparkConf(true).set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext("spark://127.0.0.1:7077", "test", conf)
val ssc = new StreamingContext(conf, Seconds(n))
val stream = ssc.actorStream[String](Props[SimpleStreamingActor], actorName, StorageLevel.MEMORY_AND_DISK)
val wc = stream.flatMap(_.split("\\s+")).map(x => (x, 1)).reduceByKey(_ + _).saveToCassandra("streaming_test", "words", SomeColumns("word", "count"))
val rdd = sc.cassandraTable("test", "kv")
println(rdd.count)
println(rdd.first)
println(rdd.map(_.getInt("value")).sum)
Scala projects are compiled by scalac, but it's quite low level: you have to setup build paths and manage all dependencies, so most people fall back to some build tool such as sbt which will manage a lot of stuff for you. The other two commonly used built tools are maven, which is favored by java old-schoolers and gradle, which is more down to earth
> how to import spark-cassandra-connector
I've set up example project. Basically, you define all of your dependencies in built.sbt or it's analog, here is how dependency on spark-cassandra-connector is defined (line #12).
> And, is it a rule that we have to code with class or object
Yes and no. If you code with sbt, all your code files has to be wrapped into object, but, sbt allows you to code in it's shell and code that you input to it is not required to be wrapped (same rules as with ordinary scala REPL). Next, both IDEA and Eclipse have worksheet capabilities, so you can create test.sc and draft your code there.

creating and using standalone scalaz jar without sbt

I've downloaded scalaz snapshot from repository (version 6.0.4).
I want to create standalone jar file and put it into my scala lib directory to use scalaz without sbt.
I'have scala package from scala-lang.org, and stored in /opt/scala
As far I did:
go to untared scalaz directory
run sbt from scalaz project
compile scalaz project
make a package (by package command)
sbt make a jar full/target/scala-2.9.1/scalaz-full_2.9.1-6.0.4-SNAPSHOT.jar
it also produce other jar: full/lib/sxr_2.9.0-0.2.7.jar
I moved both jars to /opt/scala/lib
After this I try scala repl and I can't import scalaz. I tried to import scalaz._, Scalaz._, org.scalaz._, scalaz-core._ and don't work.
REPL code completition after typing import scalaz it suggest: scalaz_2.9.1-6.0.4-SNAPSHOT.
But import scalaz_2.9.1-6.0.4-SNAPSHOT._ don't work
Any idea?
you can download scalaz and extract the jar that contains scalaz-core_2.9.1-6.0.3.jar. Or download scalaz-core directly.
then you can use : scala -cp scalaz-core_2.9.1-6.0.3.jar to launch the REPL finally import scalaz._ as expected.
If you want to use the jar produced by sbt, you can find it in core/target/scala-2.9.1/scalaz-core_2.9.1-6.0.4-SNAPSHOT.jar (you will also find source and javadoc packages in the same directory). Just put this file in your classpath (using scala -cp for example) and you will be able to import scalaz._
I think I know the problem.
scalaz-full_2.9.1-6.0.4-SNAPSHOT.jar is not a java jar class package, it's just a zip with scalaz project - so it contains not package - like directory tree (eg: directory names contains '.').
So to use it we need to unpack scalaz-full_2.9.1-6.0.4-SNAPSHOT.jar, and copy desired jars (eg: scalaz-core_2.9.1-6.0.4-SNAPSHOT.jar, scalaz-http_2.9.1-6.0.4-SNAPSHOT.jar ...) to lib directory.