How to suppress printing of variable values in zeppelin - scala

Given the following snippet:
val data = sc.parallelize(0 until 10000)
val local = data.collect
println(s"local.size")
Zeppelin prints out the entire value of local to the notebook screen. How may that behavior be changed?

You can also try adding curly brackets around your code.
{val data = sc.parallelize(0 until 10000)
val local = data.collect
println(s"local.size")}

Since 0.6.0, Zeppelin provides a boolean flag zeppelin.spark.printREPLOutput in spark's interpreter configuration (accessible via the GUI), which is set to true by default.
If you set its value to false then you get the desired behaviour that only explicit print statements are output.
See also: https://issues.apache.org/jira/browse/ZEPPELIN-688

What I do to avoid this is define a top-level function, and then call it:
def run() : Unit = {
val data = sc.parallelize(0 until 10000)
val local = data.collect
println(local.size)
}
run();

FWIW, this appears to be new behaviour.
Until recently we have been using Livy 0.4, it only output the content of the final statement (rather than echoing the output of the whole script).
When we upgraded to Livy 0.5, the behaviour changed to output the entire script.
While splitting the paragraph and hiding the output does work, it seems like an unnecessary overhead to the usability of Zeppelin.
for example, if you need to refresh your output, then you have to remember to run two paragraphs (i.e. the one that sets up your output and the one containing the actual println).
There are, IMHO, other usability issues with this approach that makes, again IMHO, Zeppelin less intuitive to use.
Someone has logged this JIRA ticket to address "the problem", please vote for it:
LIVY-507

Zeppelin, as well as spark-shell REPL, always prints the whole interpreter output.
If you really want to have only local.size string printed - best way to do it is to put println "local.size" statement inside the separate paragraph.
Then you can hide all output of the previous paragraph using small "book" icon on the top-right.

a simple trick I am using is to define
def !() ="_ __ ___ ___________________________________________________"
and use as
$bang
above or close to the code I want to check
and it works
res544: String = _ __ ___ ___________________________________________________
then I just leave there commented out ;)
// hope it helps

Related

How to remove header by using filter function in spark?

I want to remove header from a file. But, since the file will be split into partitions, I can't just drop the first item. So I was using a filter function to figure it out and here below is the code I am using :
val noHeaderRDD = baseRDD.filter(line=>!line.contains("REPORTDATETIME"));
and the error I am getting says "error not found value line "what could be the issue here with this code?
I don't think anybody answered the obvious, whereby line.contains also possible:
val noHeaderRDD = baseRDD.filter(line => !(line contains("REPORTDATETIME")))
You were nearly there, just a syntax issue, but that is significant of course!
Using textFile as below:
val rdd = sc.textFile(<<path>>)
rdd.filter(x => !x.startsWith(<<"Header Text">>))
Or
In Spark 2.0:
spark.read.option("header","true").csv("filePath")

File filter not working in Spark StreamingContext.fileStream(...) API

I'm building a Spark Streaming application where my requirement is to read all pre-existing files in a directory being monitored.
I'm using StreamingContext.fileStream(...) API for this. This API needs one to pass the filter function. In my case I'm always returning true from this as I need to read all the files.
Also the newFilesOnly flag in StreamingContext.fileStream(...) is set to false.
[Here's API doc]
But, no matter what filter function returns or newFilesOnly flag is set to, the RDDs created in corresponding DStream are empty.
Here's the code snippet:
val ssc = new StreamingContext(sparkConf, Seconds(30))
val filterF = new Function[Path, Boolean] {
def apply(x: Path): Boolean = {
println("In File " + x.toString) //Prints exisitng file's path as expected
true
}
}
val strm = ssc.fileStream[LongWritable, Text, TextInputFormat]("s3n://<bucket>/", filterF, false).map(_._2.toString)
strm.print() //DOESN'T PRINT ANYTHING
I have tried different combinations of return values from filter function and newFilesOnly flag, nothing worked.
If I use StreamingContext.textFileStream(...) instead, it works fine, but reads only new files which is expected behavior from this API.
Am I missing something here? Any help will be appreciated. Thanks in advance!
Solved it by increasing the ignore window of the FileInputDStream. This can be done by changing spark.streaming.fileStream.minRememberDuration property.
The default value is 1 min, all the files I tested with had modification time older than 1 min, so they got ignored.
See the code documentation here for more details.

How do I print the contents of an ApacheSpark RDD in my terminal?

This is my first time using Scala and ApacheSpark for a project. I'm trying to print the contents of an matrix when I run my code in the terminal, but nothing I try is working so far.
Instead I only get this printed:
org.apache.spark.mllib.linalg.distributed.MatrixEntry;#71870da7
org.apache.spark.mllib.linalg.distributed.CoordinateMatrix#1dcca8d3
I just using println() but when I use collect(), that doesn't give a good result either.
The default toString prints the name of a class followed by an address in memory.
org.apache.spark.mllib.linalg.distributed.MatrixEntry;#71870da7
You're going to want to find a way to iterate through your matrix and print each element.
Building on #zero323 's comment ( aside would you like to put an answer out there?): given an RDD[SomeType] you can call
rdd.collect()
or
rdd.take(k)
Then you can print out the results using normal toString() methods that depend on the type of the rdd contents. So if SomeType were a List[Double] then the
println(s"${rdd.collect().mkString(",")}")
would give you a single-line comma separated output of the results.
As #zero323 another consideration is: "do you really want to print out the contents of your rdd?" More likely you might only want a summary - such as
println(s"Number of entries in RDD is ${rdd.count()}")
Iterate over the rdd like this,
rdd.foreach(println)
scala>val rdd1 = sc.parallelize(List(1,2,3,4)).map(_*2)
To print the data within RDD
scala> rdd1.collect().foreach(println)
Output:
2
4
6
8

Job executed with no data in Spark Streaming

My code:
// messages is JavaPairDStream<K, V>
Fun01(messages)
Fun02(messages)
Fun03(messages)
Fun01, Fun02, Fun03 all have transformations, output operations (foreachRDD) .
Fun01, Fun03 both executed as expected, which prove "messages" is not null or empty.
On Spark application UI, I found Fun02's output stage in "Spark stages", which prove "executed".
The first line of Fun02 is a map function, I add log in it. I also add log for every step in Fun02, they all prove "with no data".
Does somebody know possible reasons? Thanks very much.
#maasg Fun02's logic is:
msg_02 = messages.mapToPair(...)
msg_03 = msg_02.reduceByKeyAndWindow(...)
msg_04 = msg_03.mapValues(...)
msg_05 = msg_04.reduceByKeyAndWindow(...)
msg_06 = msg_05.filter(...)
msg_07 = msg_06.filter(...)
msg_07.cache()
msg_07.foreachRDD(...)
I have done test on Spark-1.1 and Spark-1.2, which is supported by my company's Spark cluster.
It seems that this is a bug in Spark-1.1 and Spark-1.2, fixed in Spark-1.3 .
I post my test result here: http://secfree.github.io/blog/2015/05/08/spark-streaming-reducebykeyandwindow-data-lost.html .
When continuously use two reduceByKeyAndWindow, depending of the window, slide value, "data lost" may appear.
I can not find the bug in Spark's issue list, so I can not get the patch.

Parsing options that take more than one value with scopt in scala

I am using scopt to parse command line arguments in scala. I want it to be able to parse options with more than one value. For instance, the range option, if specified, should take exactly two values.
--range 25 45
Coming, from python background, I am basically looking for a way to do the following with scopt instead of python's argparse:
parser.add_argument("--range", default=None, nargs=2, type=float,
metavar=('start', 'end'),
help=(" Foo bar start and stop "))
I dont think minOccurs and maxOccurs solves my problem exactly, nor the key:value example in its help.
Looking at the source code, this is not possible. The Read type class used has a member tuplesToRead, but it doesn't seem to be working when you force it to 2 instead of 1. You will have to make a feature request, I guess, or work around this by using --min 25 --max 45, or --range '25 45' with a custom Read instance that splits this string into two parts. As #roterl noted, this is not a standard way of parsing.
It should be ok if only your values are delimited with something else than a space...
--range 25-45
... although you need to split them manually. Parse it with something like:
opt[String]('r', "range").action { (x, c) =>
val rx = "([0-9]+)\\-([0-9]+)".r
val rx(from, to) = x
c.copy(from = from.toInt, to = to.toInt)
}
// ...
println(s" Got range ${parsedArgs.from}..${parsedArgs.to}")