Debugging SCollection contents when running tests - scala

Is there any way to view the contents of an SCollection when running a unit test (PipelineSpec)?
When running something in production on many machines there would be no way to see the entire collection in one machine, but I wonder is there a way to view the contents of an SCollection (for example when running a unit test in debug mode in intellij).

If you want to print debug statements to the console then you can use the debug method which is part of SCollections. A sample code shown below
val stdOutMock = new MockedPrintStream
Console.withOut(stdOutMock) {
runWithContext { sc =>
val r = sc.parallelize(1 to 3).debug(prefix = "===")
r should containInAnyOrder(Seq(1, 2, 3))
}
}
stdOutMock.message.filterNot(_ == "\n") should contain theSameElementsAs
Seq("===1", "===2", "===3")

Related

How do I work with a Scala process interactively?

I'm writing a bot in Scala for a game that uses text input and output. So I want to work with a process interactively - that is, my code receives output from the process, works with it, and only then sends its next input to the process. So I want to give a function access to the inputStreams and the outputStream simultaneously.
This doesn't seem to fit into any of the factories in scala.sys.process.BasicIO or the constructor for scala.sys.process.ProcessIO (three functions, each of which has access to only one stream).
Here's how I'm doing it at the moment.
private var rogue_input: OutputStream = _
private var rogue_output: InputStream = _
private var rogue_error: InputStream = _
Process("python3 /home/robin/IdeaProjects/Rogomatic/python/rogue.py --rogomatic").run(
new ProcessIO(rogue_input = _, rogue_output = _, rogue_error = _)
)
try {
private val rogue_scanner = new Scanner(rogue_output)
private val rogue_writer = new PrintWriter(rogue_input, true)
// Play the game
} finally {
rogue_input.close()
rogue_output.close()
rogue_error.close()
}
This works, but it doesn't feel very Scala-like. Is there a more idiomatic way to do this?
So I want to work with a process interactively - that is, my code receives output from the process, works with it, and only then sends its next input to the process.
In general, this is traditionally solved by expect. There exist libraries and tools inspired by expect for various languages, including for Scala: https://github.com/Lasering/scala-expect.
The README of the project gives various examples. While I don't know exactly what your rouge.py expects in terms of stdin/stdout interactions, here's a quick "hello world" example showing how you could interact with a Python interpreter (using the Ammonite REPL, which has conveniently library importing capabilities):
import $ivy.`work.martins.simon::scala-expect:6.0.0`
import work.martins.simon.expect.core._
import work.martins.simon.expect.core.actions._
import scala.concurrent.ExecutionContext.Implicits.global
import scala.concurrent.duration._
val timeout = 5 seconds
val e = new Expect("python3 -i -", defaultValue = "?")(
new ExpectBlock(
new StringWhen(">>> ")(
Sendln("""print("hello, world")""")
)
),
new ExpectBlock(
new RegexWhen("""(.*)\n>>> """.r)(
ReturningWithRegex(_.group(1).toString)
)
)
)
e.run(timeout).onComplete(println)
What the code above does is it "expects" >>> to be sent to stdout, and when it finds that, it will send print("hello, world"), followed by a newline. From then, it reads and returns everything until the next prompt (>>>) using a regex.
Amongst other debug information, the above should result in Success(hello, world) being printed to your console.
The library has various other styles, and there may also exist other similar libraries out there. My main point is that an expect-inspired library is likely what you're looking for.

Spark collect()/count() never finishes while show() runs fast

I'm running Spark locally on my Mac and there is a weird issue. Basically, I can output any number of rows using show() method of the DataFrame, however, when I try to use count() or collect() even on pretty small amounts of data, the Spark is getting stuck on that stage. And never finishes its job. I'm using gradle for building and running.
When I run
./gradlew clean run
The program gets stuck at
> Building 83% > :run
What could cause this problem?
Here is the code.
val moviesRatingsDF = MongoSpark.load(sc).toDF().select("movieId", "userId","rating")
val movieRatingsDF = moviesRatingsDF
.groupBy("movieId")
.pivot("userId")
.max("rating")
.na.fill(0)
val ratingColumns = movieRatingsDF.columns.drop(1) // drop the name column
val movieRatingsDS:Dataset[MovieRatingsVector] = movieRatingsDF
.select( col("movieId").as("movie_id"), array(ratingColumns.map(x => col(x)): _*).as("ratings") )
.as[MovieRatingsVector]
val moviePairs = movieRatingsDS.withColumnRenamed("ratings", "ratings1")
.withColumnRenamed("movie_id", "movie_id1")
.crossJoin(movieRatingsDS.withColumnRenamed("ratings", "ratings2").withColumnRenamed("movie_id", "movie_id2"))
.filter(col("movie_id1") < col("movie_id2"))
val movieSimilarities = moviePairs.map(row => {
val ratings1 = sc.parallelize(row.getAs[Seq[Double]]("ratings1"))
val ratings2 = sc.parallelize(row.getAs[Seq[Double]]("ratings2"))
val corr:Double = Statistics.corr(ratings1, ratings2)
MovieSimilarity(row.getAs[Long]("movie_id1"), row.getAs[Long]("movie_id2"), corr)
}).cache()
val collectedData = movieSimilarities.collect()
println(collectedData.length)
log.warn("I'm done") //never gets here
close
Spark does lazy evaluation and creates rdd/df the when an action is called.
To answer you are question
1 .In the collect/Count you are calling two different actions, incase if you are
not persisting the data, which will cause the RDD/df to be re-evaluated, hence
forth more time than anticipated.
In the show only one action. and it shows only top 1000 rows( fingers crossed
) hence it finishes

Spark: run an external process in parallel

Is it possible with Spark to "wrap" and run an external process managing its input and output?
The process is represented by a normal C/C++ application that usually runs from command line. It accepts a plain text file as input and generate another plain text file as output. As I need to integrate the flow of this application with something bigger (always in Spark), I was wondering if there is a way to do this.
The process can be easily run in parallel (at the moment I use GNU Parallel) just splitting its input in (for example) 10 part files, run 10 instances in memory of it, and re-join the final 10 part files output in one file.
The simplest thing you can do is to write a simple wrapper which takes data from standard input, writes to file, executes an external program, and outputs results to the standard output. After that all you have to do is to use pipe method:
rdd.pipe("your_wrapper")
The only serious considerations is IO performance. If it is possible it would be better to adjust program you want to call so it can read and write data directly without going through disk.
Alternativelly you can use mapPartitions combined with process and standard IO tools to write to the local file, call your program and read the output.
If you end up here based on the question title from a Google search, but you don't have the OP restriction that the external program needs to read from a file--i.e., if your external program can read from stdin--here is a solution. For my use case, I needed to call an external decryption program for each input file.
import org.apache.commons.io.IOUtils
import sys.process._
import scala.collection.mutable.ArrayBuffer
val showSampleRows = true
val bfRdd = sc.binaryFiles("/some/files/*,/more/files/*")
val rdd = bfRdd.flatMap{ case(file, pds) => { // pds is a PortableDataStream
val rows = new ArrayBuffer[Array[String]]()
var errors = List[String]()
val io = new ProcessIO (
in => { // "in" is an OutputStream; write the encrypted contents of the
// input file (pds) to this stream
IOUtils.copy(pds.open(), in) // open() returns a DataInputStream
in.close
},
out => { // "out" is an InputStream; read the decrypted data off this stream.
// Even though this runs in another thread, we can write to rows, since it
// is part of the closure for this function
for(line <- scala.io.Source.fromInputStream(out).getLines) {
// ...decode line here... for my data, it was pipe-delimited
rows += line.split('|')
}
out.close
},
err => { // "err" is an InputStream; read any errors off this stream
// errors is part of the closure for this function
errors = scala.io.Source.fromInputStream(err).getLines.toList
err.close
}
)
val cmd = List("/my/decryption/program", "--decrypt")
val exitValue = cmd.run(io).exitValue // blocks until subprocess finishes
println(s"-- Results for file $file:")
if (exitValue != 0) {
// TBD write to string accumulator instead, so driver can output errors
// string accumulator from #zero323: https://stackoverflow.com/a/31496694/215945
println(s"exit code: $exitValue")
errors.foreach(println)
} else {
// TBD, you'll probably want to move this code to the driver, otherwise
// unless you're using the shell, you won't see this output
// because it will be sent to stdout of the executor
println(s"row count: ${rows.size}")
if (showSampleRows) {
println("6 sample rows:")
rows.slice(0,6).foreach(row => println(" " + row.mkString("|")))
}
}
rows
}}
scala> :paste "test.scala"
Loading test.scala...
...
rdd: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[62] at flatMap at <console>:294
scala> rdd.count // action, causes Spark code to actually run
-- Results for file hdfs://path/to/encrypted/file1: // this file had errors
exit code: 255
ERROR: Error decrypting
my_decryption_program: Bad header data[0]
-- Results for file hdfs://path/to/encrypted/file2:
row count: 416638
sample rows:
<...first row shown here ...>
...
<...sixth row shown here ...>
...
res43: Long = 843039
References:
https://www.scala-lang.org/api/current/scala/sys/process/ProcessIO.html
https://alvinalexander.com/scala/how-to-use-closures-in-scala-fp-examples#using-closures-with-other-data-types

Launch specific external process in Scala

I've been struggling to launch a specific external process in Scala. It works for most programs, but when I try
Array("gnome-terminal", "--working-directory",
"/mydir", "-x", "bash", "-c", "tmux attach")
it fails. I tried the same using Python's subprocess.Popen and it worked perfectly. Any suggestion?
My code is:
object Test {
def main(args: Array[String]):Unit = {
val source = Source.fromFile("file.txt")
val lines = source.getLines
lines.toList.map(raw => {
val programAndOptions = raw.split('$')
// here I get the Array mentioned above
Process(programAndOptions) run
})
source.close
}
}
Update
My file.txt is something like this:
evince$/path/to/my/pdf
evince$/path/to/other/pdf
nautilus$/path/to/my/working/directory
gnome-terminal$--working-directory$/mydir$-x$bash$-c$tmux attach
Update2
I ran the same code again to test and try some other things and it worked 'as is'.
'Run' is non blocking so it execute the program and continue with the code, which is this case is source.close and exit. Exiting the jvm probably kill the app.
Try using the '!' function instead of 'run' or do something like:
val p = Process(programAndOptions) run
p.exitValue

Strange issue with SBT, println, and scala console application

When I run my scala code (I'm using SBT), the prompt is displayed after I enter some text as shown here:
C:\... > sbt run
[info] Loading project definition [...]
[info] Set current project to [...]
Running com[...]
test
>>
exit
>> >> >> >> >> >> [success] Total time[...]
It seems like it's stacking up the print() statements and only displaying them when it runs a different command.
If I use println() it works as it should (except that I don't want a newline)
The code:
...
def main(args:Array[String]) {
var endSession:Boolean = false
var cmd = ""
def acceptInput:Any = {
print(">> ")
cmd = Console.readLine
if (cmd != "exit") {
if (cmd != "") runCommand(cmd)
acceptInput
}
}
acceptInput
}
...
What's going on here?
Output from print (and println) can be buffered. Scala sends output through java.io.PrintStream, which suggests that it will only auto-flush on newline, and then only if you ask. It might be OS dependent, though, since my print appears immediately.
If you add Console.out.flush after each print, you'll empty out the buffer to the screen (on any OS).