scala spark paritions: understanding the output

scala spark paritions: understanding the output - scala

I am running spark and scala. What is the meaning of the line that i get when i run rawblocks.partitions.length? My linkage folder had 10 files.
what does res1 and Int stand for?
Also is there a place where i
can find official documentation for spark methods? For example i
want to see details of textFile.
spark version 1.6.1
Using Scala version 2.10.5 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_65)
scala> val rawblocks=sc.textFile("linkage")
rawblocks: org.apache.spark.rdd.RDD[String] = linkage MapPartitionsRDD[3] at textFile at <console>:27
scala> rawblocks.partitions.length
res1: Int = 10

The res1 and Int are not special to Spark: res1 is a name given in Scala REPL (shell) to unnamed values - results are numerated (starting from zero), for example:
scala> 10
res0: Int = 10
scala> "hello"
res1: String = hello
This should also give you a clue about Int - it's the inferred type of this value (Scala's Int is somewhat equivalent to Java's Integer).
Spark API: here's the documentation for the two primary entry points of Spark-core: SparkContext, RDD

Related

scala implicit par makes no progress

This is my first attempt with using Scala's parallelism. I have huge data (that can be stored as any collection) which I would like to parallelize on a multi-core system using a simple map operation (such as val out = data.par.map(foo(_)). The example I saw given at scala docs has the following snippet that gave weird output. The 'serial' versions seems to run with implicit parallelism and the parallel versions are not working. Any pointers towards a solution would be very much appreciated.
scala> val list = (1 to 1000000).toList
list: List[Int] = List(1, 2, 3, 4, 5,... // used > 1000% cpu
scala> val out = list.map(_ + 42) // again used > 1000% cpu
out: List[Int] = List(43, 44, 45, 46,
scala> val out = list.par.map(_ + 42) // process stalls, consumes no cpu!
scala> (1 to 10) map println // initially used >400% cpu
1
2
3
4
5
6
7
8
9
10
scala> (1 to 10).par map println // process stalls, consumes no cpu!
I am using Scala 2.12.10 (OpenJDK 64-Bit Server VM, Java 1.8.0_275)
Edit: The above code ran in a script but not in scala shell. Probably some limitation of the shell itself.

Try with -Yrepl-class-based
$ scala-runners/scala --scala-version 2.12.10 -Yrepl-class-based
Welcome to Scala 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 10.0.2).
Type in expressions for evaluation. Or try :help.
scala> (1 to 10).par map println
6
9
8
1
10
7
2
3
4
5
res0: scala.collection.parallel.immutable.ParSeq[Unit] = ParVector((), (), (), (), (), (), (), (), (), ())

Spark Shell allowing to redeclare the same immutable variable [duplicate]

This question already has answers here:
Why is it possible to declare variable with same name in the REPL?
(2 answers)
Closed 3 years ago.
I am working with Spark-shell for Scala and found a strange behaviour in Spark-shell REPL which is not there if i use any IDE.
I can declare the same immutable variable again and again in REPL, but the same is not allowed in IDE.
Here is the code in REPL:
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[5] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[7] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[9] at textFile at <console>:24
scala> val rdd = sc.textFile("README.md")
rdd: org.apache.spark.rdd.RDD[String] = README.md MapPartitionsRDD[11] at textFile at <console>:24
And here is the same thing I am trying in Eclipse IDE, and it shows compile time error:
Is there anything i am missing any configuration for Spark-shell REPL?
Or is it the expected behaviour?

In your REPL, your code is actually translated as follows:
object SomeName {
val rdd = sc.textFile("README.md")
}
object Some_Other_Name {
val rdd = sc.textFile("README.md")
}
Since both your rdd vals are defined in separate Singletons, there is no name collision between them. and since this happens behind the scenes in the REPL, you feel as if you are making reassignments to the same val.
In an IDE, all our code is written inside Classes or Scala singletons(Objects), so in the same scope (inside the resp. Class/Object) you are returned an error.

Using vim editor to create a Scala script within REPL

Version of Scala I am using is Scala 2.12.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_121) and the Jline Library present is 2.14.3 .
This could sound silly but I am trying to figure out the problem when trying to create a scala file using editor cmd line vi or vim during the Scala REPL mode its throwing error. Below is my error .. Could you please let me know if there is any specific Scala Terminal console that I am suppose to use or am I doing something wrong ?
scala> vi test1.scala
<console>:1: error: ';' expected but '.' found.
vi test1.scala
I am able to do a VI and VIM as well in my system without the SCALA REPL mode but when I am in REPL I am not able to create a scala script file and execute it . What could be wrong ? Is there any settings that needs to be enabled for this ?

For saving your REPL history, use :save file.
There is limited support for using an external editor. The result of the edit is run immediately. After a reset, only the edited lines are in session history, so save will save only those lines.
$ EDITOR=gedit scala
Welcome to Scala 2.12.3 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_111).
Type in expressions for evaluation. Or try :help.
scala> val x = 42
x: Int = 42
scala> println(x)
42
scala> :edit -2
+val x = 17
+println(x)
17
x: Int = 17
scala> :hi 3
1896 val x = 17
1897 println(x)
1898 :hi 3
scala> :reset
Resetting interpreter state.
Forgetting this session history:
val x = 42
println(x)
val x = 17
println(x)
Forgetting all expression results and named terms: $intp, x
scala> :ed 1896+2
+val x = 5
+println(x)
5
x: Int = 5
scala> :save sc.sc
scala> :load sc.sc
Loading sc.sc...
x: Int = 5
5

Get objects out of scala interpreter

I spent hours investigating one topic. I am definitely out of my depth here. What I want is to run the scala interpreter programmatically and be able to extract object values from the interpreter. for example, if I send
val a = 1
val b = a + 1
I want to be able to read out b as an Int, not just a string printed out like
b = 2
The source code is dense. So far I don't see any part which would allow such an extraction. Any experts here who can give me a tip or tell me this is utter nonsense?
How do I get typed objects out of the scala interpreter between sessions?

Use JSR 223.
Welcome to Scala version 2.11.7 [...]
scala> import javax.script._
import javax.script._
scala> val engine = (new ScriptEngineManager).getEngineByName("scala")
engine: javax.script.ScriptEngine = scala.tools.nsc.interpreter.IMain#4233e892
scala> engine.eval("val a = 1")
res0: Object = 1
scala> engine.eval("val b = a + 1")
res1: Object = 2
scala> engine.eval("b").asInstanceOf[Int]
res2: Int = 2

How to clear all variables in Scala REPL

Is there a quick command? I don't want to Ctrl+d and run Scala everytime I want to clear all variables. reset, clear and clean don't work and :help doesn't have anything listed

You can use :reset
Welcome to Scala version 2.10.0-RC2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.6.0_37).
Type in expressions to have them evaluated.
Type :help for more information.
scala> val a = 1
a: Int = 1
scala> val b = 3
b: Int = 3
scala> :reset
Resetting interpreter state.
Forgetting this session history:
val a = 1
val b = 3
Forgetting all expression results and named terms: $intp, a, b
scala>